You open a website, and it immediately starts reaching out to Google, AWS, various trackers, unknown hosts, running scripts loading other scripts… Want to know what’s really going on? You can figure it out in 5 minutes using PowerShell and Chrome DevTools.

That’s exactly what we’ll talk about in this note.

How to use this?

Let’s start with the end result. Let’s see how to use the functionality of pscdp and PSGraph.

Suppose we want to see how a particular website works, or more precisely, where it connects during loading. To do this, first, you need to launch Chrome or Chromium in headless mode and enable all the necessary options so it starts waiting for CDP connections:

We talked about CDP in the previous post

chromium --headless --disable-gpu --remote-debugging-port=9222 --remote-allow-origins=ws://127.0.0.1:92225

Now you can load the pscdp and PSGraph modules and start experimenting.

Let’s collect some data by running our mini-crawler:

import-module "PSQuickGraph"
import-module "pscdp"
$res = Start-Crawling -Url "https://azazello.darkcity.dev" -MaxDegreeOfParallelism 10

The MaxDegreeOfParallelism option limits the number of browser tabs, i.e., the number of pages processed simultaneously.

The command connects to the browser on port 9222, opens a page, extracts links from that page, and follows those links—each in its own tab. It waits for the page to load for a while, then processes, closes it, and repeats the process. For each opened page, it saves the links found on it, as well as records all network events reported by the browser via CDP as the page loads and scripts execute. As a result, we get all the site pages we could reach by following links, and the network events generated by the browser.

Additionally, the object returned by the command contains a graph suitable for visualizing the relationships between pages.

PS /tmp> $res | fl

Graph          : PSGraph.Model.PsBidirectionalGraph
CapturedEvents : {ProcessedUrlData { page = CrawlTarget { Url = https://azazello.darkcity.dev/azure-networking-d3js, Depth = 0 }, events = System.Collections.Generic.List`1[BaristaLabs.ChromeDevTools.Runtime.IEvent], links =
                 System.Collections.Generic.List`1[System.String] }, ProcessedUrlData { page = CrawlTarget { Url = https://azazello.darkcity.dev/graphs-windows-firewall, Depth = 0 }, events =
                 System.Collections.Generic.List`1[BaristaLabs.ChromeDevTools.Runtime.IEvent], links = System.Collections.Generic.List`1[System.String] }, ProcessedUrlData { page = CrawlTarget { Url = https://azazello.darkcity.dev/nmap-smtp-user-enum, Depth
                 = 0 }, events = System.Collections.Generic.List`1[BaristaLabs.ChromeDevTools.Runtime.IEvent], links = System.Collections.Generic.List`1[System.String] }, ProcessedUrlData { page = CrawlTarget { Url =
                 https://azazello.darkcity.dev/powershell-graph-sysmon, Depth = 0 }, events = System.Collections.Generic.List`1[BaristaLabs.ChromeDevTools.Runtime.IEvent], links = System.Collections.Generic.List`1[System.String] }…}

We can visualize this graph using PSGraph. This generates an HTML file with a force-directed visualization based on Vega.

Export-Graph -Graph $res.Graph -Format Vega_ForceDirected -Path "/tmp/force.html"

The result looks something like this

Gray nodes are events, blue are level zero nodes, the website itself, and orange are links we found on pages but haven’t visited yet. Node colors change depending on the crawl depth you want to reach. By default, the command tries to crawl the entire site within the root domain. Don’t set the depth too high—there will be a huge number of nodes. I never go past the first level ;).

Events show which network requests the website generates. As we can see on the graph, the site doesn’t have that many external links. There are many more events. These are loaded JavaScript modules, images, and other external dependencies.

All this can also be seen in tabular form. For example, this command returns a list of visited pages

$res.CapturedEvents.Page

and this one gives the number of events and links per page. Here we use computed fields syntax

$res.CapturedEvents | Select-Object @{l="pageUrl"; e = {$_.Page.url} }, @{l="eventsCount"; e = {$_.events.count}}, @{l="linksCount"; e={$_.links.count}}

PowerShell as a Data Analysis Tool

Now let’s do another neat trick. The previous command counts all links on a page, but they can lead both inside the site and outside. Internal links aren’t that interesting to us. Let’s use the power of PowerShell and count only those that lead outside. Just for fun. For example, like this

$res.CapturedEvents | 
    Select-Object @{l="pageUrl"; e = {$_.Page.url} }, 
                  @{l="eventsCount"; e = {$_.events.count}}, 
                  @{l="externalLinks"; e={ ($_.links | ? { $_ -notlike "https://azazello.darkcity.dev/*" }).count  }},
                  @{l="totalLinks"; e={ ($_.links).count  }}

Here we add a computed column that, when calculated, filters out links starting with https://azazello.darkcity.dev/, our root domain. Then it returns their count. The result looks something like this

Where else does the site connect? You can easily see this with the following command. After a bit of processing and removing duplicates, it looks like this

$res.CapturedEvents.events.Request

As we can see, there aren’t many external links either. Most are internal.

Thus, we get a list of domains the site connects to. All that’s left is to find their addresses and add them to the firewall. :)

Want more? Add geo-IP resolution and find out which countries your traffic goes to. Or feed the list of domains to a whois analyzer and find unexpected owners. But that’s for next