From raw web pages to structured entities

MassIndex crawls the open web, extracts entities and relationships using ML, and assembles them into a searchable knowledge graph. Here's what that gives you.

Four entity types, fully profiled

Every entity extracted from the web gets a structured profile with relevant metadata, linked to related entities in the graph.

Companies

  • Industry & sub-industry
  • Products & services
  • Technologies & certifications
  • Key people (executives, founders)
  • Location (city, region, country)
  • Contact info & social links
  • Company size & type

People

  • Connected organizations
  • Role & title mentions
  • Co-occurrence with products
  • Location associations
  • Cross-referenced across sources

Products

  • Linked to maker/company
  • Category classification
  • Mentioned across articles
  • Related technologies
  • Market presence signals

Locations

  • Companies operating there
  • People associated with the area
  • Regional industry clusters
  • Geographic relationship mapping

What you can do with it

Full-Text Entity Search

Search across three collections — companies, articles, and web pages. Faceted filtering by industry, category, and domain. Highlighted snippets show why each result matched. Autocomplete suggestions as you type.

Interactive Knowledge Graph

Explore relationships visually in a force-directed graph. Filter by entity type, search by name, and traverse up to two degrees of connection. Click any node to see its full properties and connections.

Topic Monitoring & Alerts

Create custom topics with keywords, industry filters, and URL patterns. Assign priority boosts to focus crawling on what matters to you. Subscribe via webhook or email to get notified when new matching entities appear.

Public Entity Pages

Every company and domain gets a server-rendered profile page, discoverable via search engines. Structured data with JSON-LD markup means entities show up in Google with rich metadata.

ML-Powered Extraction

DistilBERT classifiers categorize content. spaCy NER models extract people, organizations, products, and locations from every page. Models are continuously retrained as new data flows in.

Continuous Crawling

A distributed crawler indexes the web around the clock. Configurable politeness, priority queues, and topic-based crawling strategies ensure coverage where it matters. Monitor crawl progress in real time.

See it for yourself

Search the index, explore the graph, or browse entity profiles.