What this url extractor pulls from raw text
The extractor matches http, https, ftp, mailto, and bare www. patterns inside any input you give it. That covers 95% of real-world links found in server logs, customer support emails, exported chat threads, and pasted articles. Bare www.example.com references get normalized to https://www.example.com before validation so they survive the parsing step. Trailing punctuation that often glues itself to URLs in prose, things like )., ,, ;, and ], gets stripped before the link is stored.
After extraction, every candidate runs through new URL(). Anything that fails parsing gets dropped. This catches malformed strings like https:// with no host or http://example with no TLD. The clean list is then deduplicated and sorted by host so you can see which domains appear most often. A 4,000-line nginx access log with 12,000 raw matches typically reduces to 80-200 unique URLs after dedupe.
How to use this url extractor
- Enter URL or paste text/HTML. Drop in a server log, JSON payload, email export, prose paragraph, or raw HTML source. You can also enter a URL in the field above and click Fetch to load that page's HTML directly.
- Choose Filter. Select All URLs to see everything, HTTP/HTTPS only to skip mailto and ftp matches, or Hide same-domain to suppress links pointing to the source page's own host.
- Hit Extract URLs. The tool returns a deduped list grouped by host, with a count next to each domain and a copy/download option for the clean output.
Try this with a Stripe webhook log. Paste 200 lines containing references like https://api.stripe.com/v1/charges/ch_3O2, https://yourapp.com/webhooks/stripe, and mailto:[email protected]. Set Filter to HTTP/HTTPS only. The extractor returns three unique URLs grouped under two hosts: api.stripe.com (1 link) and yourapp.com (1 link). The mailto match is hidden. Toggle Filter back to All URLs and the mailto reappears in its own group.
Why a text-first url extractor beats DOM parsing for messy input
Most link extractors require valid HTML and parse the DOM to find <a href> tags. That works for clean web pages but breaks on log files, JSON, plain text, and broken HTML where anchors are missing or malformed. A regex-based url extractor reads the raw bytes and finds anything that looks like a URL, regardless of structure. That makes it the right tool for security audits, support ticket triage, content migrations, and any scenario where the input is not a rendered web page.
If you need anchor text, nofollow flags, or internal-versus-external classification from a real web page, use the link-extractor instead. That tool fetches the URL, parses the DOM with DOMParser, and returns structured anchor metadata. Reach for the url extractor when the input is unstructured. Reach for the link extractor when the input is HTML and you want the anchor context.
Common mistakes
- Confusing url extractor with link extractor. The url extractor reads any text and runs regex. The link extractor parses HTML and reads
<a>tags. Use the url extractor for logs, JSON, and prose. Use link-extractor for analyzing the link structure of a live web page. - Forgetting to strip query parameters before deduping for analytics. Two URLs that differ only by
?utm_source=count as separate entries. If you want unique pages, run the output through a quick spreadsheet pass to drop query strings before counting. - Pasting truncated HTML and missing half the links. Copy-paste from "View Page Source" sometimes truncates at 100KB or cuts mid-tag. Use the URL fetch option in the field above instead, which pulls the full server response.
- Trusting the dedupe count as a unique-page count. Dedupe is host plus path plus query.
example.com/pageandexample.com/page/are two entries even though they resolve to the same page. Normalize trailing slashes if exact uniqueness matters. - Running the extractor on a 50MB log file in one paste. Browser regex on huge inputs can lock the tab for 30+ seconds. Split logs into 5MB chunks or use the fetch mode for individual URLs.
Advanced tips
- After extraction, pipe the output into the link-extractor by URL to get anchor text and rel attributes for each link. The two tools chain naturally for full link audits.
- For SEO migrations, extract every URL from your old sitemap.xml, run the same extraction on the new sitemap, and diff the lists to find missing pages. A 5,000-URL site usually has 50-200 missing redirects after a redesign.
- Use the Hide same-domain filter when auditing outbound links from a single page. A B2B marketing site with 80 internal links and 12 external partner links shows only the 12 externals, making the partner audit instant.
- Pair this with the email-extractor when processing contact dumps. The email extractor catches addresses the url extractor's mailto regex misses, like raw
[email protected]strings without themailto:prefix. - For developer workflows, run extraction on git log output to find every external reference in commit messages. A 2,000-commit history typically contains 40-80 unique reference URLs (Jira tickets, GitHub PRs, Slack threads).
Once you have a clean URL list, the next step depends on your goal. For HTML link auditing with anchor context, run each URL through the link-extractor. For finding contact info in the same input, use the email-extractor and phone-number-extractor on the same paste. For sitemap-based crawls, the sitemap-checker gives you the structured XML view that the regex approach skips.