Scraping WordPress REST API in interactive mode

I updated wp-json-scraper to add an interactive mode.

Scraping WordPress REST API in interactive mode

A long time ago, I discovered that WordPress exposed a verbose API allowing to get most of the public data without enumerating it.

I wrote a tool based on this API to allow automatic scraping of the data, mostly to explore the API itself and to make the discovery process more convenient during assessments.

I recently updated this tool to integrate a new interactive mode.

Why an interactive mode

By using this tool quite often, I quickly came into several problems. Specifying every parameter on one single command line is not really pretty and is prone to become error-packed.

The obvious solution was to allow separation between commands and that can be achieved through a dedicated interactive mode. This multi staging in the workflow also allows to have separate steps when conducting discovery of a WordPress instance.

This opens up many new possibilities:

  • Fine control over the data scraped (how much, at which time)
  • Caching of already scraped data (to limit requests to the server)
  • Granular commands with limited command length (opposed to the bulky command-line mode)
  • etc.

How?

Starting in interactive mode is straightforward:

./WPJsonScraper.py --interactive https://example.com

You can specify each option by using the various commands. Global options (like credentials or cookies) are automatically fed into the commands typed into interactive mode.

First of all you may want to use the show all and the set commands to configure authentication, cookies or the target.

Cookies created by the server are stored and reused in a single session until you change the target.

When you're happy with these global settings, you can scrape data using the list command. It will get the requested data for you.

This includes all data that could already be scraped in past releases such as: posts, media, comments, pages, etc.

In fact, you have control over the list command. You can choose to export detailed information using the --json flag or essential information using the --csv flag. You can also limit the number of returned results by using the --start and --limit flags to specify the start page and the maximum number of results you want.

When you see a content of interest, you can use the fetch command to get detailed information displayed in your terminal. If you used the listcommand before and if it returned the content you are fetching, no request to the server will be made (unless you explicitly ask for it). The content will be retrieved from the cache that lies in you local memory.

In the particular case of media, the dl command will allow you to download the specified media (by ID, from the cache - experimental -, or all the media content). Note that in some cases, that could imply the download from an external source (in cases such as CDNs for example).

All the commands are documented on the Github repository and by using the help command for a brief and the -h flag on a command for detailed help.

Other considerations

This update also provides some improvements under the hood:

  • Caching capabilities improvements
  • Code reuse reduction
  • Various bug fixes (also fixed in 0.4.1 but included in the release packages for 0.5)

I didn't forget to also add the download option to the main command line.

Conclusion

There is still room for improvement. The interactive mode is still in a quite unstable state right now. I'll work on some issues and will improve the reliability in the next few months.

At the moment, I mostly overlooked the authenticated part of the WordPress REST API. That includes post revisions, themes, available blocks, etc.

These should be adressed in the future for a new release.

On the other side, despite the interactive mode, I don't plan to transform this tool in a WordPress full-featured client interface (for example to write posts in a command-line environment). The main goal is still to identify information leak using the API.