Automatically extract names, companies, and dates from text

iniciante · 8 min · Por Ana Brainiall

What NER solves that regex can't

Regex is great for rigid patterns: a ZIP code always has a fixed format, an email always has @. But people's names, companies, and dates have no fixed pattern:

"Pedro Silva", "Maria da Conceição dos Santos", "Dr. Fernando" — all names
"Petrobras", "Banco do Brasil", "Itaú Unibanco SA", "Loja do Seu Zé" — all companies
"January 5th", "01/05/2026", "last Friday", "next month" — all dates

NER uses a language model that learns to understand context: "the company Itaú" vs "Itaú street". Regex can't make that distinction; NER gets it right 95%+ of the time.

texto de exemplo colorido com highlights em cores diferentes — nomes em azul, em

Standard and custom entities

Public NER models (spaCy, HuggingFace) detect:

PER (Person): Pedro Silva, Dr. João
ORG (Organization): Petrobras, Google
LOC (Location): São Paulo, Brazil
DATE: January 5th, 2026
MONEY: R$ 1,500, USD 200
TIME: 3:30 PM, 9 in the morning
PERCENT: 20%, 0.5

For specific domains, you can train a custom model. Examples:

Legal: laws (Lei 13.709), case numbers (N° 1234567-89.2024), courts
Medical: medications, diseases (ICD-10), procedures
Financial: stock tickers, bank branches, account numbers

Brainiall offers custom models on demand on the Business plan.

How it works under the hood (in 30 seconds)

1. Tokenization: text is broken into words and punctuation
2. POS tagging: each word receives a grammatical class (noun, verb...)
3. Contextualization: each word is converted into a vector of 768+ dimensions considering its neighbors
4. BIO classification: each token is tagged as Begin-entity, Inside-entity, or Outside. E.g.: "Pedro" (B-PER) "Silva" (I-PER) "works" (O) "at" (O) "Petrobras" (B-ORG)
5. Aggregation: consecutive B+I tokens become a single entity

Modern models (mBERT, XLM-R, multilingual DeBERTa) run this pipeline in ~10–50ms for a paragraph.

Practical use cases

CRM enrichment: extract companies and contacts from emails to update your database
News analysis: monitor mentions of your brand, competitors, and executives in the media
Compliance: find personal names in documents for data privacy audits
Research: extract authors, citations, and dates from academic papers at scale
Legal analysis: identify parties in a case, cited laws, and judgment dates

Specific limitations for Portuguese

Compound names with prepositions: "Maria dos Santos" — some models split it into "Maria" + "Santos" as two separate entities
Family businesses without a legal suffix: "Padaria do Zé" may be treated as a description rather than an entity
Nicknames: "Lula" as a person vs "lula" as the word for squid — case sensitivity varies
Brazilian addresses: Street + name + number + ZIP code — segmentation can go wrong
Acronyms: Is "USP" an entity or just a word?

Tip: for borderline cases, always manually review 100 examples before going to production.

Integrating via API

A single endpoint returns an array of entities:

`python
import httpx
r = httpx.post(
"https://api.brainiall.com/api/nlp/ner",
json={"text": "Pedro Silva, from Petrobras, announced on January 5th."},
headers={"Authorization": "Bearer brnl-xxx"}
)
# [{"text": "Pedro Silva", "type": "PER", "start": 0, "end": 11},
# {"text": "Petrobras", "type": "ORG", "start": 16, "end": 25},
# {"text": "January 5th", "type": "DATE", "start": 40, "end": 52}]
`

Try it right now

Ask "extract people, companies, and dates from this text: [paste]" in the Brainiall chat. Or use the API at /api/nlp/ner. The Pro plan at $29 includes 10k requests/month; Business adds batch processing and custom models.