Skip to content
Live check · fetches your URL server-side

URL Extractor

Extract every URL from any text, log file, email, or webpage in one click.

A url extractor scans any block of text, log file, JSON dump, email body, or raw HTML and pulls out every link it finds. This tool runs a regex over whatever you paste, validates each match with the browser's URL parser, strips trailing punctuation like commas and periods, dedupes the results, and groups them by host. Filter the output to all URLs, HTTP/HTTPS only, or hide same-domain links. No signup, no upload, everything stays in your browser.

or

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

What this url extractor pulls from raw text

The extractor matches http, https, ftp, mailto, and bare www. patterns inside any input you give it. That covers 95% of real-world links found in server logs, customer support emails, exported chat threads, and pasted articles. Bare www.example.com references get normalized to https://www.example.com before validation so they survive the parsing step. Trailing punctuation that often glues itself to URLs in prose, things like )., ,, ;, and ], gets stripped before the link is stored.

After extraction, every candidate runs through new URL(). Anything that fails parsing gets dropped. This catches malformed strings like https:// with no host or http://example with no TLD. The clean list is then deduplicated and sorted by host so you can see which domains appear most often. A 4,000-line nginx access log with 12,000 raw matches typically reduces to 80-200 unique URLs after dedupe.

How to use this url extractor

  1. Enter URL or paste text/HTML. Drop in a server log, JSON payload, email export, prose paragraph, or raw HTML source. You can also enter a URL in the field above and click Fetch to load that page's HTML directly.
  2. Choose Filter. Select All URLs to see everything, HTTP/HTTPS only to skip mailto and ftp matches, or Hide same-domain to suppress links pointing to the source page's own host.
  3. Hit Extract URLs. The tool returns a deduped list grouped by host, with a count next to each domain and a copy/download option for the clean output.

Try this with a Stripe webhook log. Paste 200 lines containing references like https://api.stripe.com/v1/charges/ch_3O2, https://yourapp.com/webhooks/stripe, and mailto:[email protected]. Set Filter to HTTP/HTTPS only. The extractor returns three unique URLs grouped under two hosts: api.stripe.com (1 link) and yourapp.com (1 link). The mailto match is hidden. Toggle Filter back to All URLs and the mailto reappears in its own group.

Why a text-first url extractor beats DOM parsing for messy input

Most link extractors require valid HTML and parse the DOM to find <a href> tags. That works for clean web pages but breaks on log files, JSON, plain text, and broken HTML where anchors are missing or malformed. A regex-based url extractor reads the raw bytes and finds anything that looks like a URL, regardless of structure. That makes it the right tool for security audits, support ticket triage, content migrations, and any scenario where the input is not a rendered web page.

If you need anchor text, nofollow flags, or internal-versus-external classification from a real web page, use the link-extractor instead. That tool fetches the URL, parses the DOM with DOMParser, and returns structured anchor metadata. Reach for the url extractor when the input is unstructured. Reach for the link extractor when the input is HTML and you want the anchor context.

Common mistakes

  • Confusing url extractor with link extractor. The url extractor reads any text and runs regex. The link extractor parses HTML and reads <a> tags. Use the url extractor for logs, JSON, and prose. Use link-extractor for analyzing the link structure of a live web page.
  • Forgetting to strip query parameters before deduping for analytics. Two URLs that differ only by ?utm_source= count as separate entries. If you want unique pages, run the output through a quick spreadsheet pass to drop query strings before counting.
  • Pasting truncated HTML and missing half the links. Copy-paste from "View Page Source" sometimes truncates at 100KB or cuts mid-tag. Use the URL fetch option in the field above instead, which pulls the full server response.
  • Trusting the dedupe count as a unique-page count. Dedupe is host plus path plus query. example.com/page and example.com/page/ are two entries even though they resolve to the same page. Normalize trailing slashes if exact uniqueness matters.
  • Running the extractor on a 50MB log file in one paste. Browser regex on huge inputs can lock the tab for 30+ seconds. Split logs into 5MB chunks or use the fetch mode for individual URLs.

Advanced tips

  • After extraction, pipe the output into the link-extractor by URL to get anchor text and rel attributes for each link. The two tools chain naturally for full link audits.
  • For SEO migrations, extract every URL from your old sitemap.xml, run the same extraction on the new sitemap, and diff the lists to find missing pages. A 5,000-URL site usually has 50-200 missing redirects after a redesign.
  • Use the Hide same-domain filter when auditing outbound links from a single page. A B2B marketing site with 80 internal links and 12 external partner links shows only the 12 externals, making the partner audit instant.
  • Pair this with the email-extractor when processing contact dumps. The email extractor catches addresses the url extractor's mailto regex misses, like raw [email protected] strings without the mailto: prefix.
  • For developer workflows, run extraction on git log output to find every external reference in commit messages. A 2,000-commit history typically contains 40-80 unique reference URLs (Jira tickets, GitHub PRs, Slack threads).

Once you have a clean URL list, the next step depends on your goal. For HTML link auditing with anchor context, run each URL through the link-extractor. For finding contact info in the same input, use the email-extractor and phone-number-extractor on the same paste. For sitemap-based crawls, the sitemap-checker gives you the structured XML view that the regex approach skips.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

Frequently Asked Questions

What is a URL extractor?

A url extractor is a tool that scans any text input and pulls out every URL it contains. It uses a regular expression to find patterns starting with http, https, ftp, mailto, or www., then validates each match against the browser's URL parser to drop malformed strings. After validation, the tool deduplicates the list and groups results by host. This BlazeHive extractor handles 4,000-line log files, JSON payloads up to a few megabytes, raw HTML, and pasted prose without uploading anything to a server. Everything runs in the browser. Common use cases include security log triage, support ticket review, content migration audits, and quick competitive research on outbound links from a competitor article. Use the link-extractor when you specifically need anchor text and nofollow flags from a live web page instead.

How do you extract URLs from text?

Paste your text into the URL or paste text/HTML field, choose a filter, and click Extract URLs. The tool runs a regex pattern over your input that matches http://, https://, ftp://, mailto:, and bare www. strings. Each match runs through new URL() to filter out malformed candidates, then trailing punctuation like commas, periods, parentheses, and brackets gets stripped. The deduped list groups by host and shows a count for each domain. A 200-line email thread typically yields 5-15 unique URLs in under a second. For programmatic extraction, the same regex pattern works in Python (re.findall), JavaScript (String.prototype.matchAll), or grep with the -oE flag, but a browser tool gives you instant filtering and host grouping that a one-liner does not.

Can I extract URLs from a log file?

Yes. Server logs are one of the most common inputs for this url extractor. Paste up to a few megabytes of nginx, Apache, or application log lines into the textarea and click Extract URLs. The tool finds every URL referenced in the log, dedupes, and groups by host. A 4,000-line nginx access log with 12,000 raw matches typically reduces to 80-200 unique URLs after dedupe. For files larger than 5MB, split the log into chunks before pasting. If you need to extract URLs from logs as part of an automated workflow, the same regex pattern works in grep -oE on the command line. Use the browser tool when you want quick visual grouping. Use grep when you want to pipe the output into another script. After extraction, run suspicious external hosts through the link-extractor to inspect the actual page content.

How do you extract URLs from a PDF?

The url extractor works on text, so first convert the PDF to text. On macOS, run pdftotext input.pdf output.txt from the command line (install via brew install poppler). On Windows or Linux, use the same pdftotext utility from the Poppler package. Open the resulting text file, copy the contents, and paste into the URL or paste text/HTML field. PDFs created from web pages typically contain 20-100 URLs. PDFs from native documents like reports usually contain 5-20. The extractor catches URLs that span line breaks if the PDF preserves them as continuous strings. Watch for PDFs that wrap long URLs across multiple lines with hyphens. Those become two broken URLs. If your output looks suspicious, open the PDF in Preview or Adobe Reader and use the built-in "Save as Text" export, which preserves URL continuity better than command-line tools for some PDFs.

How do you extract URLs from an email?

Open the email, view the raw source or "Show Original" (Gmail) or "View Source" (Outlook), and copy the full HTML body. Paste that into the URL or paste text/HTML field and click Extract URLs. A typical marketing email contains 15-40 URLs including tracking pixels, unsubscribe links, social icons, and content links. Choose the HTTP/HTTPS only filter to skip the mailto: reply-to address. For Apple Mail or Thunderbird, you can copy the rendered email directly without viewing source, but you will miss the URLs hidden in image alt attributes and tracking pixels. View source for the most complete extraction. If you process emails in bulk, export your inbox to mbox format and run the file through the extractor in chunks. The same regex catches URLs in plain text and HTML email bodies equally well.

How does the URL extractor validate URLs?

Every regex match runs through JavaScript's new URL() constructor. If the constructor throws, the candidate gets dropped. This filters out malformed strings like https:// with no host, http://example with no top-level domain, and partial fragments. The validator does not check whether the URL resolves or returns a 200 response. It only verifies syntactic validity. To check whether URLs are live, run the extracted list through the http-status-checker one host at a time. The validation step typically rejects 5-10% of raw regex matches as malformed. Bare www.example.com references get normalized with an https:// prefix before validation so they pass the parser. Trailing punctuation like )., ,, and ] gets stripped before validation to prevent false rejections from URLs glued to surrounding prose.

What protocols does the URL extractor support?

The extractor matches five protocol patterns: http://, https://, ftp://, mailto:, and bare www. (which gets normalized to https://www.). That covers 95% of URLs found in real-world inputs. The extractor does not match tel:, sms:, file://, chrome-extension://, or other browser-specific schemes by design. If you need to extract phone numbers from text that mixes URLs and tel: links, use the phone-number-extractor on the same input. The HTTP/HTTPS only filter restricts output to web URLs and drops mailto: and ftp:// matches. The All URLs filter shows everything. The 5-protocol scope keeps the regex pattern fast enough to process multi-megabyte inputs in under a second on modern browsers. Custom protocols can be added by editing the regex pattern in the source if you fork the tool.

How do you deduplicate URLs?

Dedupe runs automatically after extraction. The tool compares each URL by full string match including protocol, host, path, query, and fragment. Two URLs that differ by even one character count as separate entries. That means https://example.com/page and https://example.com/page/ are two entries (trailing slash difference). It also means example.com/page?utm_source=email and example.com/page?utm_source=social are two entries. If you want unique pages regardless of query parameters, paste the output into a spreadsheet, split on ?, and dedupe the path-only column. A 12,000-match raw extraction from server logs typically dedupes to 80-200 unique URLs, a 60-150x reduction. The dedupe step is what makes the extractor useful for log triage. Without dedupe, you get the raw match count, which is rarely the number you want.

How do I count URLs in text?

Paste the text, click Extract URLs, and the tool shows the unique URL count next to the host group totals. For raw match count before dedupe, look at the result summary which displays both numbers. A 1,000-word blog article typically contains 5-20 URLs. A 10-page legal document contains 30-80 URLs. A server log file contains 50-500 unique URLs depending on traffic patterns. If you need just the count without seeing the URLs, this tool still gives you the fastest answer. For programmatic counts in scripts, use grep -oE 'https?://[^ ]+' input.txt | sort -u | wc -l on the command line. The browser tool wins on inputs you already have in your clipboard. The command-line approach wins for batch processing 100+ files.

What is the difference between a URL extractor and a link extractor?

A url extractor reads raw text with regex and pulls out anything that looks like a URL. It works on logs, JSON, prose, broken HTML, and any unstructured input. A link extractor parses valid HTML with DOMParser and reads <a href> tags, returning structured metadata like anchor text, nofollow flags, and internal-versus-external classification. The url extractor is text-first. The link-extractor is HTML-aware. Use the url extractor when your input is messy or you do not have a live URL. Use the link extractor when you want to audit the link structure of a real web page with full anchor context. The two tools complement each other: extract URLs from a log, then run each unique URL through the link extractor to see what each page links to. Together they cover both unstructured text and structured HTML inputs.

Can the URL extractor classify internal vs external links?

The url extractor does not classify links by default because it has no concept of a base URL when you paste raw text. The Hide same-domain filter does the next-best thing: when you fetch a URL using the field above, the tool knows the source host and can hide URLs pointing to that same host. That gives you an external-only view. For pasted text without a fetched source, every URL counts as external because there is no base URL to compare against. If you need true internal-versus-external classification with anchor text, use the link-extractor. It fetches the URL, parses the HTML, and labels each <a> tag as internal or external based on the source page's host. A typical B2B blog post has 80% internal links and 20% external. A news article reverses that ratio.

How do I extract URLs from HTML?

Paste the raw HTML source into the URL or paste text/HTML field and click Extract URLs. The regex catches URLs inside <a href>, <img src>, <link href>, <script src>, <iframe src>, inline CSS url() calls, and plain-text URLs in the HTML body. A 50KB HTML page typically contains 80-200 unique URLs across all those sources. Use the HTTP/HTTPS only filter to focus on web URLs and skip mailto: and ftp:// matches. For HTML where you specifically want only the anchor links and need anchor text, use the link-extractor instead. It parses the DOM and returns only <a href> matches with text content. The url extractor is broader. The link extractor is more precise for anchor-specific audits. Pick based on whether you want every URL or only the anchor links.

What regex pattern does the URL extractor use?

The pattern matches the five supported protocol prefixes followed by valid URL characters: roughly (?:https?|ftp|mailto):\/\/[^\s<>"]+ plus a separate match for bare www. strings. The exact pattern handles edge cases like URLs ending in punctuation, URLs with parentheses, and URLs containing query parameters with special characters. After matching, each candidate runs through new URL() validation, so the regex can be slightly permissive without polluting the output. If you want to use the same pattern outside the browser, the simplified version https?:\/\/[^\s<>"]+ works in Python re.findall and grep -oE. The full pattern with mailto and www support requires more careful escaping. For most ad-hoc extractions, the simplified pattern catches 95% of real URLs. The browser tool uses the full version for completeness.

Is the URL extractor free?

Yes. This url extractor is free with no signup, no upload, and no rate limit. Everything runs in your browser. You can paste up to a few megabytes of text and extract URLs in under a second. The tool does not send your input to a server, which matters when processing logs or emails that contain sensitive data. Compare to paid tools like Screaming Frog (250 free URLs, then $259/year for unlimited) or Ahrefs (no free URL extraction, requires $99/month subscription). Those tools include features the BlazeHive extractor does not, like recursive crawling and SEO scoring. For pure URL extraction from text or HTML, the free browser tool covers the use case. Pair it with the email-extractor and phone-number-extractor for full contact-data extraction from the same input, all free.

Why does the URL extractor strip trailing punctuation?

Because URLs in prose often have punctuation glued to the end. A sentence like "Read the article at https://example.com/post." has the period attached to the URL after a naive regex match. Without stripping, the extracted URL becomes https://example.com/post. which fails the URL parser because the trailing period is not part of the path. The extractor strips )., ,, ;, ], >, ", ', and . from the end of each match before validation. That recovers the clean URL https://example.com/post. The same logic handles URLs in markdown like [link](https://example.com) where the closing parenthesis is not part of the URL. Without trailing punctuation stripping, 10-20% of URLs extracted from prose would fail validation and get dropped. The strip step is what makes the extractor reliable on real-world text instead of just clean log files.

Related free tools

All tools →