Diskover SDK and API Guide

Software Development Toolkit and Application Programming Interface

This guide is intended for Service Professionals and System Administrators.

Introduction

Overview

Diskover Data is a web-based platform that provides single-pane viewing of distributed digital assets. It provides point-in-time snapshot indexes of data fragmented across cloud and on-premise storage spread across an entire organization. Users can quickly and easily search across company files. Diskover is a data management application for your digital filing cabinet, providing powerful granular search capabilities, analytics, file-based workflow automation, and ultimately enables companies to scale their business and be more efficient at reducing their operating costs.

For more information, please visit diskoverdata.com

Approved AWS Technology Partner

Diskover Data is an official AWS Technology Partner. Please note that AWS has renamed Amazon Elasticsearch Service to Amazon OpenSearch Service. Most operating and configuration details for OpenSearch Service should also be applicable to Elasticsearch..

Document Conventions

TOOL	PURPOSE
Copy/Paste Icon for Code Snippets	Throughout this document, all code snippets can easily be copied to a clipboard using the copy icon on the far right of the code block:
🔴	Proposed action items
⚠️	Important notes and warnings
Features Categorization	IMPORTANT Diskover features and plans were repackaged as of January 2025. Please refer to Diskover's solutions page for more details. You can also consult our detailed list of core features. Contact us to discuss your use cases, size your environment, and determine which plan is best suited for your needs. Throughout this guide, you'll find labels indicating the plan(s) to which some feature belongs.
Core Features
Industry Add-Ons	These labels will only appear when a feature is exclusive to a specific industry.

Develop Your Own Python File Action Plugins

Introduction

This chapter describes how to create a Python File Action as a Flask Blueprint in the Diskover Admin App.

The Diskover Admin app is a secondary web interface designed to incorporate new features into Diskover. Written in Python using the Flask framework and served by uvicorn, it runs under nginx as a reverse proxy, taking over the /diskover_admin route of the main Diskover web application.

The app is modular, allowing separate development and registration of components. The main app.py sets up the shared environment for all components, including logging, base templates, static files, error handling, Celery, database, and Elasticsearch connections.

Structure

When developing, consider the project's file and directory structure, as each area plays a role in the overall app.

Project's File or Directory	Purpose
diskover-admin/	Project dir
diskover_admin/	Source code for the Diskover Admin app
etc/	Installation files and miscellaneous items
instance/	Data unique to this instance, including the main configuration file
log/	Log files
run/	Socket for nginx communication
scripts/	Project-wide scripts
sessions/	On-disk session data for each user
wsgi.py	Script for starting the Diskover Admin app with `uvicorn`

Most important to us are the instance/ and diskover_admin/ directories. Let's take a look at the Flask application in the diskover_admin directory first.

Source Code

diskover_admin/ Directory	Purpose
diskover_admin/	Source code for the Diskover Admin app
common/	Contains libraries and modules that are used in many places throughout the app
routers/	Each module/directory at this level constitutes a route of the main app and children directories/modules are routes themselves. Here we can visualize what a URL in the app would look like by looking at the directory structure. An example route inferred by the project structure: `diskover_admin/routers/fileactions/liveview/views.py` will contain routes for the liveview fileaction at `http://yourserver/diskover_admin/fileactions/liveview/`
static/	Holds shared static files that are common throughout all components, such as JavaScript and CSS files
templates/	Project-wide templates used to ensure all components have base functionality and look the same.

The Instance Directory

The instance directory contains configuration specific to your installation. Project-wide configs are stored and loaded from /diskover-admin/instance/config.py. Setting up this file should be one of the first things done after installation. If it is not configured properly for your environment, specific components might not load, or even the whole app.

You should find a config.sample.py file here that you can rename to config.py and modify to suit your needs.

A Fileaction Example

Just like other parts of the app, each fileaction directory follows a common pattern. It's so rigid that you can often begin the development of a new fileaction by copying another and renaming the different parts.

Each fileaction is a Flask blueprint that attaches to the main app at a specific route and serves all the routes under that.

Directory or File	Purpose
diskover-admin/diskover_admin/fileactions/	The root of the fileactions route
example/	An example fileaction directory
static/	Static directory with files only visible to this fileaction
templates/	Template files only visible to this fileaction
`__`init`__`.py	Needed to set up the directory as a project and point to the views
config.yaml	Optional config variables visible only to this fileaction
views.py	The main file that defines the routes and how each is processed

The Anatomy of a Fileaction

Let's look at the "example" fileaction piece by piece and call out some important aspects of how it works. The views.py file contains all the routes that can be called and the logic behind them. It either renders a new page with the given template or returns JSON data that can be consumed by an AJAX call.

Besides imports, the blueprint is the first part of a fileaction you will see. This defines the Flask blueprint for this action. It defines which route the action will be served on /example and the name of the static and template dirs.

Note that because all fileactions parents are served under /fileactions, this will actually be served at /fileactions/examples under the main /diskover_admin route. You can find more info about Flask blueprints here: Flask Blueprints

blueprint = Blueprint(
    'example', __name__,
    static_folder='static',
    template_folder='templates',
    url_prefix='/example'
)

Next, we define some global variables that are available to all templates in the fileaction. These will be rendered in the nav bar on the left to identify them.

blueprint.context_processor(lambda: {'app_name': 'Fileaction Example', 'version': str(VERSION)})

Next, we load config options from the optional config.yaml in the same directory into a config dictionary:

parent_dir = os.path.dirname(__file__)
config = parse_config_yaml(os.path.join(parent_dir, 'config.yaml'))

Finally, we get to the first of two view functions that expose a page of the fileaction. This is the default (index) page that renders when you first visit.

@blueprint.route('/', methods=['POST'])
@get_es_files
def index():
    worker = session.get('worker')
    sources = request.args.getlist('files')
    context = {
        'worker': worker,
        'sources': [s.to_dict() for s in sources],
        'config': config
    }
    return render_template('example.html', **context)

Let's take this step by step:

1) First we use the @blueprint.routedecorator to register a route for this view. This route will accept POST requests at /. Because we are using nested blueprints, the actual route will be /diskover_admin/fileactions/example/. Most of the route is built by the parent blueprints, and the slash at the end corresponds to this specific route. Remember that routes are always relative to their parent, so when we define / on a blueprint with a route of /example we are really creating the route at /example/.

2) @get_es_files is a special decorator that converts the Elasticsearch doc and index that are passed by diskover-web into SimpleFileInfo objects that correspond to a path on the filesystem and what type of item it is (file or directory). This needs to be present on the index function in every fileaction or else you won't have the paths of the objects you intend to work on. It also sets the worker in the session from the Elasticsearch index. With this worker, we know who we should send any task to that has access to the selected files.

3) Sources are a list of SimpleFileInfo objects that describe the selected files. See more at diskover-admin/diskover_admin/common/util.py.

4) We create the contextdictionary to encapsulate all the variables we want passed to and available in the template.

5) When we return and render_template, we instruct Flask to render the given template using the context variables we just set. We will get to templating in a bit.

If you were to run this now, with the provided example.html, you would see that it renders a web page at diskover-admin/diskover_admin/fileactions/example/ with the variables we passed in from the context in different parts of the page. Notice the Submit button at the bottom of the page. We are going to use jQuery and AJAX in our JS file to bind the submit action of this button to a JavaScript function that will submit the fileaction and wait for a response. Take a look at the included JavaScript file at example/static/js/example.js.

We are not going to get into details about the JavaScript, but in general, it calls fileaction/example/submit/ with arguments of a form we made in the example.html, including the sources we previously selected. This call is made asynchronously, so we start and stop a spinner in the upper right corner of the navbar, and register a few callbacks to execute when it receives a response. When the response is received, it will flash a message at the top of the screen, show the hidden output container, and write the response of the fileaction in the output window.

It makes sense, but we still need to look at the submit view to see how it's called and handles the fileaction. Below is the submit view from example/views.py that handles calling the fileaction on the worker.

@blueprint.route('/submit', methods=['POST'])
def submit():
    worker = session.get('worker')
    if worker is None:
        msg = 'Worker not set in session'
        flash(msg, 'error')
        logger.error(msg)
        return jsonify({'data':None, 'error': msg}), 500

    data = request.form.to_dict()
    sources = json.loads(data['sources'])
    logger.info(
        f"Submitting echo task to {worker} "
        f"for sources: {[s['path'] for s in sources]}"
    )

    task = current_app.extensions['celery'].send_task(
        'echo',
        args=(sources, ),
        queue=f'fileactions.{worker}',
        exchange='fileactions'
    )
    return redirect(url_for('tasks.get_result_sync', task_id=task.id))

Again, let's break it down:

1) First, we register the /submit route on our example blueprint.

2) Next, we get the worker; this should execute from our session and verify it's valid.

3) Then we take the data from the form that was sent and load the JSON string for sources into a list of dictionaries.

4) Next we create the task with the name echo and pass the sources list and send it to the queue of the worker that has access to the files.

5) Finally, we tell the app to use the task.id, the task created when it was submitted to wait for the result and return it.

The last step there is a bit confusing, so a little more detail is in order. First of all, this might be the first time you are seeing the url_for() construct. This is an internal way Flask uses to identify other routes in the application. In this case, the url_for() generates a URL like tasks/result_sync/<task_id> and then executes it.

The tasks route has a few helpers for handling Celery tasks. The synchronous version that we are using here waits for the result and then returns a JSON-encoded string with the result returned by the fileaction. The AJAX function then receives this result and displays it on the web page.

Template Structure

In each of our fileaction templates, we follow a few simple guidelines to ensure they all have the same format. Here is a simplified version of the example.html template to show the guidelines.

{% extends 'base.html' %}
{% block header %}
    {% include 'example_header.html' %}
{% endblock %}
{% include 'example_nav.html' %}
{% block content %}

    ALL OF THE CONTENT GOES HERE!

{% endblock %}
{% block footer %}
    {% include 'example_footer.html' %}
{% endblock %}

In the first line, we are "extending" the base.html template. This template defines the basic layout and includes project-level CSS and JavaScript files. It helps make this fileaction look the same as all the rest. Next, we include the example_header.html template which has info that will go in the header of all views in this fileaction. Usually, we include CSS files or styles here. Next is the most important, the context block. This is where we put all of the HTML and templating that should be used to handle our fileaction.

In the example case, we create a form and an output container. Finally, we include the example_footer.html which should include links to any specific JavaScript files we want in this fileaction. We use another url_for() here to point it to the example/static/js/example.js file. Once included we can call functions from that JavaScript file.

How to Register Fileactions

During the development of a fileaction, you will want to register it in the main app so that you can navigate to the routes during testing and select the fileaction from within diskover-web. There are three steps:

1) Duplicate another fileaction.php file from/to /var/www/diskover-web/public/fileactions with the name of the new fileaction, and modify the form action to point to the base URL of this new action. Generally, you just need to change the name component to the name of your fileaction:

<form id="myForm" action="../diskover_admin/fileactions/example/" method="post">

2) Add a section in the Constants.php pointing to the new fileaction.php file.

3) In the instance/config.py file, add a COMPONENT to the list with the name of your fileaction.

4) Restart the diskover-admin service.

Celery Tasks

One unique feature of the Diskover Admin app, and the fileactions in particular, is the ability of the web application to execute tasks on remote systems. The Diskover Admin app has no direct visibility to the files that were scanned. Instead, the indexer machines that scanned the data do. To execute processes on the files, we send a message to the appropriate indexer/worker and ask it to run the task and return a response back to us.

This is accomplished by defining "tasks" that can be run, and starting a worker process on each indexer to listen for messages and run the task. Celery is the framework used to facilitate these processes, and a message broker like RabbitMQ is deployed between the web server and workers to pass messages back and forth.

The Structure of the Worker

On each indexer/worker, a folder containing the tasks that can be executed and the configuration needed is located at /opt/diskover/diskover_celery.

The basic structure for /opt/diskover/ is described below.

/opt/diskover/ File or Directory	Purpose
diskover_celery/	DESCRIPTION MISSING SEAN!
etc/	Install files and miscellaneous items
common/	Libraries and modules used in different tasks
tasks/	Files containing all of the tasks that can be executed
celeryconfig.py	The main configuration file
worker.py	The main entry point for the worker process

The celeryconfig.py file needs to be configured with variables to connect to the RabbitMQ broker and also contains a section of imports to denote which task files should be registered and exposed.

Writing a Simple Task

Once registered, the worker listens for messages from the web server and routes the message to the task with the matching name. They are passed arguments in the message for context on how to run. Let's take a look at a simple task function. You can find it in tasks/example.py.

@shared_task(bind=True, name='echo')
@json_exception
def echo(self, sources):
    logger.info(f'Calling echo task with: {sources}')
    return {'result': sources, 'error': None, 'traceback': None}

Each task should be defined by the @shared_task decorator and passed bind=True and a name equal to the task name you want to be registered. This name should be unique and is how the web server will invoke the task.

Note: The task's filename is not relevant when calling the task, only the name.

The @json_exception decorator is designed to catch any exceptions that occur inside the task and return a dictionary response with the error and traceback filled in. On successful execution, a dictionary is returned with the relevant data in the result value.

Tasks should always return a dictionary with the three key-value pairs defined so the calling application can interpret the results. The result field can contain anything you want but must be JSON serializable.

Develop Your Own PHP File Action Plugins

Overview

This section covers the basics on how to create your own plugin. For example, you can add extra metadata to an index during crawl time by adding a plugin to the Diskover crawler. Some other examples are database lookups to apply extra tags, content indexing and if keyword found tag file, copy or backup file if matches certain criteria, etc. This is all done during crawl time.

Getting Started

Plugins are stored in the plugins/ folder in the root directory of Diskover. There are a few examples in the plugins folder to get you started. Plugins are run during a crawl.

🔴 Make a directory in plugins with the name of the plugin, example myplugin

🔴 Create a file in the myplugin directory named __init__.py

🔴 Copy the code from one of the example plugins and edit to create the plugin. There are six required function names but they can be edited however you want as long as the return value type is the same. The six required function names for plugins are:

add_mappings
add_meta
add_tags
for_type
init
close

🔴 To enable or disable a plugin, edit the Diskover config file in the plugins section. For example, to enable the media info harvest plugin add mediainfo to the files list in the plugins section.

🔴 To list all plugins that will be used during crawls:

python3 diskover.py -l

How to Create Diskover-Web File Action Plugins

This section covers the basics on how to create your own web plugins, known as File Actions. There are several examples in the public/fileactions/ directory in diskover-web. The examples all end with the extension .sample.

You will need some basic familiarity with PHP to create a File Action. A File Action can also call system processes, as shown in some of the examples to run local and remote scripts/commands.

At the top of every file action php file you will need:

// override debug output in fileactions include file
$fileactions_debug = FALSE;

include 'includes/fileactions.php';
include 'includes/fileactions_header.php';

And at the bottom:

include 'includes/fileactions_footer.php';

In the example file actions, you will see a foreach loop that itterates over the selected files/directories:

foreach ($fileinfo as $file) {
    ...
}

$fileinfo is an associative array of each selected file/directory info which contains the ES index doc info (includes/fileactions.php):

$fileinfo[] = array(
    'docid' => $queryResponse['hits']['hits'][0]['_id'],
    'index' => $queryResponse['hits']['hits'][0]['_index'],
    'index_nocluster' => $mnclient->getIndexNoCluster($docindices_arr[$key]),
    'fullpath' => $queryResponse['hits']['hits'][0]['_source']['parent_path'] . '/' . $queryResponse['hits']['hits'][0]['_source']['name'],
    'source' => $queryResponse['hits']['hits'][0]['_source'],
    'type' => $queryResponse['hits']['hits'][0]['_source']['type']
);

So for example, to get the fullpath of the file, you would use $file['fullpath'], or to get the index name $file['index'], or to get the type (file or directory) $file['type'].

If you need to translate paths, you can do so with the built in translate_path (includes/fileactions.php) function which accepts two args.

$fullpath = $file['fullpath'];
$path_translations = array(
    '/^\//' => '/mnt/'
);
$fullpath = translate_path($fullpath, $path_translations);

To learn more about using and configuring web plugins, please refer to the File Actions section of the Configuration and Administration Guide.

Develop Your Own Alternate Scanner

Introduction to Writing an Alternate Scanner

Before talking about how to write an alternate scanner for Diskover, let's talk about what exactly a scanner is and define some terms. A scanner in Diskover is a piece of code that walks over some sort of hierarchical data set and gathers information about each node as it goes. It's often represented as a tree with a filesystem being the most common use case. In this case, the information it gathers would be the size of each file, when it was modified, the owner, etc.

Note however that a filesystem is only one use case. It could just as easily be a database that is specially designed for this kind of data, or more commonly the Amazon S3 buckets we frequently see. These are not standard filesystems, but they can still be scanned by Diskover with a little creativity and magic. This is where the alternate scanners come in.

Alternate scanners work like adapters for the main Diskover process, enabling communication with some data sources that are foreign to it. They have all the parts needed to connect to the data source, list data on it, recursively walk through the data, and then return it in a format that Diskover understands, so it can display the metadata, as well as use it for search, analytical and workflow purposes.

One big advantage of an alternate scanner is that the data source might have additional data beyond the typical dataset we find on a filesystem. For example, Amazon S3 allows you to store user-defined metadata fields on each object in its store. With an S3 scanner, we could choose to pull that extra information and send it to Diskover for various uses.

Setup

In order to follow this guide, you'll need an FTP server running that has some data on it. This test was performed using the pyftpd library serving up a home directory with a simple script.

🔴 It must be run as root to be able to serve on the default port 21:

python3 -m pip install pyftpd

🔴 Put the codes below in a file called ftp_server.py:

#!/usr/bin/env python3

from pyftpdlib.authorizers import DummyAuthorizer
from pyftpdlib.handlers import FTPHandler
from pyftpdlib.servers import FTPServer

authorizer = DummyAuthorizer()
authorizer.add_user("user", "12345", "/home/sean", perm="elradfmwMT")

handler = FTPHandler
handler.authorizer = authorizer

server = FTPServer(("127.0.0.1", 21), handler)
server.serve_forever()

🔴 Execute it:

sudo python3 ftp_server.py

🔴 Once it starts up, you have a temporary FTP server listening on port 21. Don't forget to update the directory to share in the add_user() function.

The Plan

As an example, we are going to build an alternate scanner that scans an FTP site and displays the data in Diskover. As mentioned, to be able to do this we are going to need to be able to connect to the FTP site, stat files and directories in it, list directories, and walk the tree. Instead of working with the low-level ftplib python library, we'll use a high-level library that will do a lot of the work for us, called ftputil.

🔴 Run the following to get it installed:

python3 -m pip install ftputil

🔴 To start, we need to think about implementing six functions that the main diskover process will call in our alt scanner. These are the required interface for Diskover to communicate with our scanner:

stat()
scandir()
walk()
check_dirpath()
abspath()
get_storage_size()

Let's talk about them one by one.

FUNCTION	DESCRIPTION
stat(path)	`stat()` accepts a path and returns an object that resembles a stat_result object. It needs to have the following attributes: `st_mode`, `st_inode`, `st_dev`, `st_nlink`, `st_uid`, `st_gid`, `st_size`, `st_sizedu`, `st_ctime`, `st_mtime`, and `st_atime`. It's okay for some of these to be None, but they must all be present on the object.
scandir(path)	`scandir()` accepts a path and returns a list of objects that resemble DirEntry objects. Each one needs to have the following attributes and methods: `path`, `name`, `stat()`, `is_dir()`, `is_file()`, `is_symlink()`, `inode()`
walk(path)	`walk()` is a function (resembling os.walk()) that takes a path and recursively walks down the tree returning a tuple of the (path walked, files in the path, dirs in the path) for each directory found.
check_dirpath(path)	`check_dirpath()` takes a root path and verifies that it actually exists.
abspath(path)	`abspath()` converts a path given on the command line to an absolute path usable by Diskover.
get_storage_size(path)	`get_storage_size()` attempts to figure out the total and free size of the storage we are working on.

Building the Scanner | The Foundation

Now that we've defined the six functions we need to implement, let's see how we can use the ftputil package to begin building an alternate scanner.

🔴 First, we will make an FTPServer class that will wrap the ftputil server connection and give it the six required functions:

import ftputil

class FTPServer:
    def __init__(self, host, user, password, storage_size):
        self.host = host
        self.user = user
        self.password = password
        self.storage_size = storage_size

        self.ftp_host = ftputil.FTPHost(host, user, password)

    def get_storage_size(self, path):
        pass

    def abspath(self, path):
        pass

    def check_dirpath(self, path):
        pass

    def scandir(self, top):
        pass

    def walk(self, top):
        pass

    def stat(self, path):
        pass

🔴 Now let's instantiate it with attributes needed to connect, and a storage size we determine offline since there is no way to determine that through the FTP protocol:

ftpserver = FTPServer(
    host='localhost',
    user='user',
    password='12345',
    storage_size=1.074e+11,
)

🔴 Before we even try running the scanner with Diskover, it's often helpful to run it as a standalone script and get the directory walking and scanning functions in place first. To do that, we can call the ftpserver.walk() function as it will be called from Diskover and just print out the results. First, let's start with the scandir() function as that is called from the walk() function, scandir() should list the contents of a directory on the server and return them. Here's what it looks like so far:

import os
import ftputil

class FTPServer:
    def __init__(self, host, user, password, storage_size):
        self.host = host
        self.user = user
        self.password = password
        self.storage_size = storage_size

        self.ftp_host = ftputil.FTPHost(host, user, password)

    def get_storage_size(self, path):
        pass

    def abspath(self, path):
        pass

    def check_dirpath(self, path):
        pass

    def scandir(self, path):
        entries = self.ftp_host.listdir(path)

        for name in entries:
            yield os.path.join(path, name)

    def walk(self, top):
        pass

    def stat(self, path):
        pass

ftpserver = FTPServer(
    host='localhost',
    user='user',
    password='12345',
    storage_size=1.074e+11,
)

for f in ftpserver.scandir('/'):
    print(f)

🔴 Now we can get a listing of the top level of our FTP server by running the script, but as mentioned previously, scandir() needs to return objects that resemble DirEntry objects. Let's make a new class called FTPDirEntry that takes a path as well as our FTPServer object, and update the scandir() function to return FTPDirEntry objects:

class FTPDirEntry:
    def __init__(self, path, server):
        self._server = server
        self.path = path

    @property
    def name(self):
        return os.path.basename(self.path)

    def is_dir(self):
        return self._server.ftp_host.path.isdir(self.path)

    def is_file(self):
        return self._server.ftp_host.path.isfile(self.path)

    def is_symlink(self):
        return self._server.ftp_host.path.islink(self.path)

    def stat(self):
        return self._server.stat(self.path)

    def inode(self):
        return None

    def __repr__(self):
        return f'FTPDirEntry(type={"DIR" if self.is_dir() is True else "FILE"}, path={self.path})'

🔴 Luckily the ftputil package has some cool built-in functions that do exactly what we want, so we just map them directly onto our methods. Also, update the scandir() function:

    def scandir(self, path):
        entries = self.ftp_host.listdir(path)

        for name in entries:
            yield FTPDirEntry(os.path.join(path, name), self)

🔴 Now when we run the script, it will print out FTPDirEntry objects instead of strings and show us whether they are directories or files. We can list a directory on the FTP server, now let's write the code to walk the directories recursively. Below is the new code with a walk() method added, and instead of starting it up by calling scandir, we are calling walk with '/' and printing out everything it returns:

import os
import ftputil

class FTPServer:
    def __init__(self, host, user, password, storage_size):
        self.host = host
        self.user = user
        self.password = password
        self.storage_size = storage_size

        self.ftp_host = ftputil.FTPHost(host, user, password)

    def get_storage_size(self, path):
        pass

    def abspath(self, path):
        pass

    def check_dirpath(self, path):
        pass

    def scandir(self, path):
        entries = self.ftp_host.listdir(path)

        for name in entries:
            yield FTPDirEntry(os.path.join(path, name), self)

    def walk(self, path):
        dirs, nondirs = list(), list()
        for entry in self.scandir(path):
            if entry.is_dir() is True:
                dirs.append(entry.name)
            else:
                nondirs.append(entry.name)
        yield path, dirs, nondirs

        # Recurse into sub-directories
        for name in dirs:
            new_path = os.path.join(path, name)
            yield from self.walk(new_path)

    def stat(self, path):
        pass


class FTPDirEntry:
    def __init__(self, path, server):
        self._server = server
        self.path = path

    @property
    def name(self):
        return os.path.basename(self.path)

    def is_dir(self):
        return self._server.ftp_host.path.isdir(self.path)

    def is_file(self):
        return self._server.ftp_host.path.isfile(self.path)

    def is_symlink(self):
        return self._server.ftp_host.path.islink(self.path)

    def stat(self):
        return self._server.stat(self.path)

    def inode(self):
        return None


    def __repr__(self):
        return f'FTPDirEntry(type={"DIR" if self.is_dir() is True else "FILE"}, path={self.path})'


ftpserver = FTPServer(
    host='localhost',
    user='user',
    password='12345',
    storage_size=1.074e+11,
)

for root, dirs, files in ftpserver.walk('/'):
    print(f'Walking {root}')
    for d in dirs:
        print(f'DIR: {d}')
    for f in files:
        print(f'FILE: {f}')

Note: Now we have some basic functionality that will walk our FTP server, but it's missing quite a few details to work with Diskover. Let's fill in some of the details as described in the next section.

Building the Scanner | The Final Steps

🔴 The first thing that Diskover will try to do with our scanner is to run check_dirpath() on the root path we pass in on the command line and make sure it's a valid path on the FTP server. We can check if a path exists on the FTP server by trying to list the remote directory. So let's have check_dirpath() call ftp_host.listdir and try it with '/' which should always work (since it's the root), and '/foo' which doesn't exist. It should return a tuple of (bool, error_code):

    def check_dirpath(self, path):
        self.ftp_host.listdir(path)
        return True, None

🔴 Then call it with:

print(ftpserver.check_dirpath('/'))
print(ftpserver.check_dirpath('/FOO'))

🔴 After running it, it looks like the first one succeeded and the second threw a ftputil.error.PermanentError: 550 when we tried to list a non-existent directory. Now that we know that error, we can modify check_dirpath() and have it handle that error and return False and an error code when we get it, instead of letting it kill the program. What this means is that if a user calls the scanner with a bad path on the command line, they will get a nice error telling them that the folder does not exist:

    def check_dirpath(self, path):
        try:
            self.ftp_host.listdir(path)
        except ftputil.error.PermanentError as e:
            if e.errno == 550:
                return False, f'No such directory! {repr(path)}'
            else:
                raise
        return True, None

🔴 Next, Diskover is going to call our scanner's abspath() function. Luckily ftputil has another built-in function so that's an easy one. Often times here you want to sanitize the input given by the user, for example removing trailing slashes:

    def abspath(self, path):
        if path != '/':
            path = path.rstrip('/')
        return self.ftp_host.path.abspath(path)

🔴 For get_storage_size(), it should return a tuple of the total size of the storage, the free size, and the available size. Since this runs at the start of the crawl, and we have no access to the actual filesystem, we have no way to calculate these numbers. We only have the total size because we manually entered it into the config beforehand. So we will just return the total size three times, so it will appear full in Diskover:

    def get_storage_size(self, path):
        return self.storage_size, self.storage_size, self.storage_size

🔴 Finally, let's update the stat() function with a simple built-in function from ftputil. Then let's take out our invocation code at the bottom and add the code needed to get it to run as an alternate scanner. When Diskover imports an alt scanner, it first looks for those six required functions at the top level of our code, but since we have them defined as methods on an instance, we need to assign those methods to the names it expects at the top level:

walk = ftpserver.walk
scandir = ftpserver.scandir
check_dirpath = ftpserver.check_dirpath
abspath = ftpserver.abspath
get_storage_size = ftpserver.get_storage_size
stat = ftpserver.stat

🔴 Here's what it should look like:

import os
import ftputil

class FTPServer:
    def __init__(self, host, user, password, storage_size):
        self.host = host
        self.user = user
        self.password = password
        self.storage_size = storage_size

        self.ftp_host = ftputil.FTPHost(host, user, password)

    def get_storage_size(self, path):
        return self.storage_size, self.storage_size, self.storage_size

    def abspath(self, path):
        if path != '/':
            path = path.rstrip('/')
        return self.ftp_host.path.abspath(path)

    def check_dirpath(self, path):
        try:
            self.ftp_host.listdir(path)
        except ftputil.error.PermanentError as e:
            if e.errno == 550:
                return False, f'No such directory! {repr(path)}'
            else:
                raise
        return True, None

    def scandir(self, path):
        entries = self.ftp_host.listdir(path)

        for name in entries:
            yield FTPDirEntry(os.path.join(path, name), self)

    def walk(self, path):
        dirs, nondirs = list(), list()
        for entry in self.scandir(path):
            if entry.is_dir() is True:
                dirs.append(entry.name)
            else:
                nondirs.append(entry.name)
        yield path, dirs, nondirs

        # Recurse into sub-directories
        for name in dirs:
            new_path = os.path.join(path, name)
            yield from self.walk(new_path)

    def stat(self, path):
        return self.ftp_host.stat(path)


class FTPDirEntry:
    def __init__(self, path, server):
        self._server = server
        self.path = path

    @property
    def name(self):
        return os.path.basename(self.path)

    def is_dir(self):
        return self._server.ftp_host.path.isdir(self.path)

    def is_file(self):
        return self._server.ftp_host.path.isfile(self.path)

    def is_symlink(self):
        return self._server.ftp_host.path.islink(self.path)

    def stat(self):
        return self._server.stat(self.path)

    def inode(self):
        return None


    def __repr__(self):
        return f'FTPDirEntry(type={"DIR" if self.is_dir() is True else "FILE"}, path={self.path})'


ftpserver = FTPServer(
    host='localhost',
    user='user',
    password='12345',
    storage_size=1.074e+11,
)

walk = ftpserver.walk
scandir = ftpserver.scandir
check_dirpath = ftpserver.check_dirpath
abspath = ftpserver.abspath
get_storage_size = ftpserver.get_storage_size
stat = ftpserver.stat

🔴 We now have something that can begin to be run by Diskover, and we can figure out the missing pieces from there. Make sure you have the file in /opt/diskover/scanners/ftp_scanner.py and then:

cd /opt/diskover

🔴 You can try executing it with:

python3 diskover.py --altscanner=ftp_scanner /

🔴 When we run this command, it runs for a little bit and then throws an error when it's calling our stat() function on the root directory ftputil.error.RootDirError: can't stat remote root directory

It looks like a limitation of the ftputil package, so we will have to handle the error in stat() and return default information in this case. We can use a SimpleNamespace object from the built-in types library to mimic a stat_result object with default info set:

Note: Don't forget to import SimpleNamespace!* from types import SimpleNamespace.

    def stat(self, path):
        try:
            return self.ftp_host.stat(path)
        except ftputil.error.RootDirError:
            return SimpleNamespace(**{
                'st_mode': 0,
                'st_ino': 0,
                'st_dev': None,
                'st_nlink': 1,
                'st_uid': 0,
                'st_gid': 0,
                'st_size': 1,
                'st_sizedu': 1,
                'st_ctime': 0,
                'st_mtime': 0,
                'st_atime': 0,
          })

This is running even farther now, but it looks like the StatResult object returned by ftputil doesn't have at least one of the attributes that Diskover is expecting st_sizedu. Here is what a ftputil StatResult object looks like:

StatResult(st_mode=33204, st_ino=None, st_dev=None, st_nlink=1, st_uid='sean', st_gid='sean', st_size=20811, st_atime=None, st_mtime=1670964480.0, st_ctime=None)

🔴 We are going to have an issue with those None values for the times as well, so let's just return another SimpleNamespace object and take out the attrs from the StatResult we need and transform them if needed. We will assume the st_mtime is always set and if the st_atime or st_ctime is None, set them to the st_mtime:

    def stat(self, path):
        try:
            st_res = self.ftp_host.stat(path)
            return SimpleNamespace(**{
                'st_mode': st_res.st_mode,
                'st_ino': st_res.st_ino,
                'st_dev': st_res.st_dev,
                'st_nlink': st_res.st_nlink,
                'st_uid': st_res.st_uid,
                'st_gid': st_res.st_gid,
                'st_size': st_res.st_size,
                'st_sizedu': st_res.st_size,
                'st_ctime': st_res.st_ctime if st_res.st_ctime is not None else st_res.st_mtime,
                'st_mtime': st_res.st_mtime,
                'st_atime': st_res.st_atime if st_res.st_atime is not None else st_res.st_mtime,
          })
        except ftputil.error.RootDirError:
            return SimpleNamespace(**{
                'st_mode': 0,
                'st_ino': 0,
                'st_dev': None,
                'st_nlink': 1,
                'st_uid': 0,
                'st_gid': 0,
                'st_size': 1,
                'st_sizedu': 1,
                'st_ctime': 0,
                'st_mtime': 0,
                'st_atime': 0,
          })

🔴 Finally, let's add two top-level functions that Diskover requires that we haven't mentioned yet. These are functions that allow you to add extra metadata and tags to each file as a scan is running. For now, we are just going to leave them empty:

def add_meta(path, stat):
    return None

def add_tags(metadict):
    return None

🔴 Now we have what should be a complete scanner, but when we run this with the default Diskover config, we're running into problems. It's hanging up with repeated a message like OSError 250 / is the current directory. Our guess is that either the ftputil package or the FTP server itself doesn't handle threads and multiple connections well, so it's getting confused about where we are and what we want at any given time.

The main Diskover process is multi-threaded so it's throwing lots of requests for different things to the server at one time. To test this out, we can make the Diskover process single-threaded by changing two config settings in ~/.config/diskover/config.yaml.

Edit that file and set both maxthreads: 1 and maxwalkthreads: 1 then try running the scanner again. When we change those settings and run the scanner again...it works!

🔴 While this works, it would be better to use multiple threads and use locking on the sections of the code that need it, so we get a performance boost. So basically what we want to do is use a threading. Lock any time we are doing a ftp_host operation. We'll just add that code and post the whole thing below. We're also adding the optional close() function so that our ftp_host is gracefully closed at the end of the operation.

import os
from types import SimpleNamespace
from threading import Lock

import ftputil


class FTPServer:
    def __init__(self, host, user, password, storage_size):
        self.host = host
        self.user = user
        self.password = password
        self.storage_size = storage_size

        self.ftp_host = ftputil.FTPHost(host, user, password)
        self.lock = Lock()

    def get_storage_size(self, path):
        return self.storage_size, self.storage_size, self.storage_size

    def abspath(self, path):
        if path != '/':
            path = path.rstrip('/')
        with self.lock:
            return self.ftp_host.path.abspath(path)

    def check_dirpath(self, path):
        try:
            with self.lock:
                self.ftp_host.listdir(path)
        except ftputil.error.PermanentError as e:
            if e.errno == 550:
                return False, f'No such directory! {repr(path)}'
            else:
                raise
        return True, None

    def scandir(self, path):
        with self.lock:
            entries = self.ftp_host.listdir(path)

        for name in entries:
            yield FTPDirEntry(os.path.join(path, name), self)

    def walk(self, path):
        dirs, nondirs = list(), list()
        for entry in self.scandir(path):
            if entry.is_dir() is True:
                dirs.append(entry.name)
            else:
                nondirs.append(entry.name)
        yield path, dirs, nondirs

        # Recurse into sub-directories
        for name in dirs:
            new_path = os.path.join(path, name)
            yield from self.walk(new_path)

    def stat(self, path):
        try:
            with self.lock:
                st_res = self.ftp_host.lstat(path)
            return SimpleNamespace(**{
                'st_mode': st_res.st_mode,
                'st_ino': st_res.st_ino,
                'st_dev': st_res.st_dev,
                'st_nlink': st_res.st_nlink,
                'st_uid': st_res.st_uid,
                'st_gid': st_res.st_gid,
                'st_size': st_res.st_size,
                'st_sizedu': st_res.st_size,
                'st_ctime': st_res.st_ctime if st_res.st_ctime is not None else st_res.st_mtime,
                'st_mtime': st_res.st_mtime,
                'st_atime': st_res.st_atime if st_res.st_atime is not None else st_res.st_mtime,
          })
        except ftputil.error.RootDirError:
            return SimpleNamespace(**{
                'st_mode': 0,
                'st_ino': 0,
                'st_dev': None,
                'st_nlink': 1,
                'st_uid': 0,
                'st_gid': 0,
                'st_size': 1,
                'st_sizedu': 1,
                'st_ctime': 0,
                'st_mtime': 0,
                'st_atime': 0,
          })

    def close(self, _):
        self.ftp_host.close()


class FTPDirEntry:
    def __init__(self, path, server):
        self._server = server
        self.path = path

    @property
    def name(self):
        return os.path.basename(self.path)

    def is_dir(self):
        with self._server.lock:
            return self._server.ftp_host.path.isdir(self.path)

    def is_file(self):
        with self._server.lock:
            return self._server.ftp_host.path.isfile(self.path)

    def is_symlink(self):
        with self._server.lock:
            return self._server.ftp_host.path.islink(self.path)

    def stat(self):
        return self._server.stat(self.path)

    def inode(self):
        return None

    def __repr__(self):
        return f'FTPDirEntry(type={"DIR" if self.is_dir() is True else "FILE"}, path={self.path})'


def add_meta(path, stat):
    return None

def add_tags(metadict):
    return None


ftpserver = FTPServer(
    host='localhost',
    user='user',
    password='12345',
    storage_size=1.074e+11,
)

walk = ftpserver.walk
scandir = ftpserver.scandir
check_dirpath = ftpserver.check_dirpath
abspath = ftpserver.abspath
get_storage_size = ftpserver.get_storage_size
stat = ftpserver.stat
close = ftpserver.close

Conclusion

So there you have it, a working FTP alt scanner for Diskover! There is still more that could be done, for example, we should probably:

Move those FTP connection values into a config file.
We should be able to connect to servers that use TLS.
We might want to add special metadata as it's scanning.
And we might want to add tags to each file as it's scanning.

Even so, you now have an understanding of all the parts necessary to build your own alternate scanners. Good luck!

Using the Diskover-Web API

Overview

Diskover-Web has a REST API for creating, getting, updating, and deleting index and task data.

GET (with curl or web browser)

Getting file/directory tag info is done with the GET method.

For "tags" and "search" endpoints, you can set the page number and result size with ex. &page=1 and &size=100. Default is page 1 and size 1000.

Curl example:

curl -X GET http://localhost:8000/api.php/indexname/endpoint

List all Diskover indices and stats for each:

GET http://localhost:8000/api.php/list

List all files with no tag (untagged):

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=file

List all directories with no tag (untagged):

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=&type=directory

List files with tag "version 1":

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=file

List directories with tag "version 1":

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201&type=directory

List files/directories (all items) with tag "version 1":

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=version%201

List files with tag "archive":

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=archive&type=file

List directories with tag "delete":

GET http://localhost:8000/api.php/diskover-2018.01.17/tags?tag=delete&type=directory

List total size (in bytes) of files for each tag:

GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?type=file

List total size (in bytes) of files with tag "delete":

GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=delete&type=file

List total size (in bytes) of files with tag "version 1":

GET http://localhost:8000/api.php/diskover-2018.01.17/tagsize?tag=version%201&type=file

List total number of files for each tag:

GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?type=file

List total number of files with tag "delete":

GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=delete&type=file

List total number of files with tag "version 1":

GET http://localhost:8000/api.php/diskover-2018.01.17/tagcount?tag=version+1&type=file

Search index using ES query syntax:

GET http://localhost:8000/api.php/diskover-2018.01.17/search?query=extension:png%20AND%20type:file%20AND%20size:>1048576

Get latest completed index using top path in index:

GET http://localhost:8000/api.php/latest?toppath=/dirpath

Get disk space info for all top paths in an index:

GET http://localhost:8000/api.php/diskover-2018.01.17/diskspace

Update (with JSON object)

Updating file/directory tags and tasks is done with the PUT method. You can send a JSON object in the body. The call returns the status and number of items updated.

Curl example:

curl -X PUT http://localhost:8000/api.php/index/endpoint -d '{}'

Tag files "delete":

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Tag files with tags "archive" and "version 1":

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["archive", "version 1"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Remove tag "delete" for files which are tagged "delete":

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": ["delete"], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Remove all tags for files:

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagfiles
{"tags": [], "files": ["/Users/shirosai/file1.png", "/Users/shirosai/file2.png"]}

Tag directory "archive" (non-recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}

Tag directories and all files in directories with tags "archive" and "version 1" (non-recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive", "version 1"], "dirs": ["/Users/shirosai/Downloads", "/Users/shirosai/Documents"], "tagfiles": "true"}

Tag directory and all sub-dirs (no files) with tag "version 1" (recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true"}

Tag directory and all items (files/directories) in directory and all sub-dirs with tag "version 1" (recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["version 1"], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}

Remove tag "archive" from directory which is tagged "archive":

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": ["archive"], "dirs": ["/Users/shirosai/Downloads"]}

Remove all tags from directory and all files in directory (non-recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "tagfiles": "true"}

Remove all tags from directory and all items (files/directories) in directory and all sub dirs (recursive):

PUT http://localhost:8000/api.php/diskover-2018.01.17/tagdirs
{"tags": [], "dirs": ["/Users/shirosai/Downloads"], "recursive": "true", "tagfiles": "true"}

Update task as disabled (or enabled)

PUT http://localhost:8000/api.php/diskover-2018.01.17/updatetask
{"id": "4eba40842e2248b1fb3b1f6631bef7e8", "disabled": true}

Create (with JSON object)

Updating tasks is done with the POST method. You can send a JSON object in the body. The call returns the item that was created.

Curl example:

curl -X POST http://localhost:8000/api.php/endpoint -d '{}'

Create an index task

POST http://localhost:8000/api.php/diskover-2018.01.17/addtask
{"type": "index", "name": "FOO", "crawl_paths": "/foo", "retries": 1, "timeout": 30, "retry_delay": 1, "run_min": 0, "run_hour": ,"run_month": "*", "run_day_month": "*", "run_day_week": "*"}

Delete (with JSON object)

Deleting tasks is done with the DELETE method. You can send a JSON object in the body. The call returns the status.

Curl example:

curl -X DELETE http://localhost:8000/api.php/endpoint -d '{}'

Delete a task

DELETE http://localhost:8000/api.php/diskover-2018.01.17/deletetask
{"id": "4eba40842e2248b1fb3b1f6631bef7e8"}

Examples of API calls in Python

"""example usage of diskover-web rest-api using requests and urllib
"""
import requests
try:
    from urllib import quote
except ImportError:
    from urllib.parse import quote
import json

url = "http://localhost:8000/api.php"

List all diskover indices:

r = requests.get('%s/list' % url)
print(r.url + "\n")
print(r.text + "\n")

List total number of files for each tag in diskover-index index:

index = "diskover-index"
r = requests.get('%s/%s/tagcount?type=file' % (url, index))
print(r.url + "\n")
print(r.text + "\n")

List all png files in diskover-index index:

q = quote("extension:png AND _type:file AND filesize:>1048576")
r = requests.get('%s/%s/search?query=%s' % (url, index, q))
print(r.url + "\n")
print(r.text + "\n")

Tag directory and all files in directory with tag "archive" (non-recursive):

d = {'tag': 'archive', 'path_parent': '/Users/cp/Downloads', 'tagfiles': 'true'}
r = requests.put('%s/%s/tagdir' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")

Create a custom task

d = {'type': 'Custom', 'name': 'FOOBAR', 'run_command': 'python3',
     'runcommand_args': '/home/foo/myscript.py', 'run_min': 0, 'run_hour': 1,
     'run_month': '*', 'run_day_month': '*', 'run_day_week': '*', 'retries': 1,
     'timeout': 30, 'retry_delay': 1, 'description': 'YAHOO!'
}
r = requests.post('%s/%s/addtask' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")

Update a task to enabled and have it run now

d = {'id': '4eba40842e2248b1fb3b1f6631bef7e8', 'disabled': false, 'run_now': true}
r = requests.put('%s/%s/updatetask' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")

Delete a task

d = {'id': '4eba40842e2248b1fb3b1f6631bef7e8'}
r = requests.delete('%s/%s/deletetask' % (url, index), data = json.dumps(d))
print(r.url + "\n")
print(r.text + "\n")

Support

Support Options

Support & Ressources	Free Community Edition	Annual Subscription*
Online Documentation Diskover online documentation	✅	✅
Slack Community Support #Diskover Slack Channel	✅	✅
Diskover Community Forum Share, learn, and connect on Zendesk How to create a Zendesk account	✅	✅
Knowledge Base Access our knowledge base articles How to create a Zendesk account	✅	✅
Technical Support Create a support ticket How to create a Zendesk account How to create a support ticket		✅
Phone Support (800) 560-5853 Monday to Friday \| 8am to 6pm PST		✅
Remote Training Contact us for details		✅

*

Feedback

We'd love to hear from you! Email us at info@diskoverdata.com

Warranty & Liability Information

Please refer to our Diskover End-User License Agreements for the latest warranty and liability disclosures.

Contact Diskover

Method	Coordinates
Website	https://diskoverdata.com
General Inquiries	info@diskoverdata.com
Sales	sales@diskoverdata.com
Demo request	demo@diskoverdata.com
Licensing	licenses@diskoverdata.com
Support	Open a support ticket with Zendesk 800-560-5853 \| Mon-Fri 8am-6pm PST
Slack	Join the Diskover Slack Workspace
GitHub	Visit us on GitHub
AJA Media Edition	530-271-3190 sales@aja.com support@aja.com

© Diskover Data, Inc. All rights reserved. All information in this manual is subject to change without notice. No part of the document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopying or recording, without the express written permission of Diskover Data, Inc.