README.md

# IPoid

https://phabricator.wikimedia.org/T325147

IPoid (Wikitech: [IPoid](https://wikitech.wikimedia.org/wiki/IPoid), Gerrit: [mediawiki/services/ipoid](https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/ipoid,general)) is a node-based service. It offers 2 basic functionalities:

**Local storage for Spur data**

IPoid calls Spur to fetch a large (~700MB) gzipped JSON. In turn that data is processed and stored in its MariaDB database. This gzipped data unzips to ~4GB with the row count in the order of millions.

**Regular data updates**

TODO: IPoid should update its database daily.

This is using the template service-runner template (https://github.com/wikimedia/service-template-node)

---

Interaction with MediaWiki
IPoid's REST API is accessed via a MediaWiki extension [SecurityApi](https://www.mediawiki.org/wiki/Extension:SecurityApi). It will receive requests for this data from MediaWiki and provides access through its RESTful API.

---


## Running the application

### Baremetal

```
npm start
```

This starts an HTTP server listening on `localhost:6927`. There are several
routes you may query (with a browser, or `curl` and friends):

* `http://localhost:6927/_info/`
* `http://localhost:6927/_info/name`
* `http://localhost:6927/_info/version`
* `http://localhost:6927/_info/home`
* `http://localhost:6927/{domain}/v1/siteinfo{/prop}`
* `http://localhost:6927/{domain}/v1/page/{title}`
* `http://localhost:6927/{domain}/v1/page/{title}/lead`
* `http://localhost:6927/ex/err/array`
* `http://localhost:6927/ex/err/file`
* `http://localhost:6927/ex/err/manual/error`
* `http://localhost:6927/ex/err/manual/deny`
* `http://localhost:6927/ex/err/auth`

### With Docker

```
docker-compose up --detach
```

Two containers should start:
```
Creating network "ipoid_default" with the default driver
Creating ipoid_db_1 ... done
Creating ipoid_web_1 ... done
```

#### Configs

Edits to configs for local use should be made in `docker-compose.override.yml`. Most configs are good out of the box but you'll need to update the following:

|Config|Value|
|---|---|
|`SPUR_API_KEY`|API key|

## Getting Started

### Getting Data

**Data from Spur (manual)**

Feed files can be retrieved from Spur using an API key (keep this safe!) for local use. These files should ideally be placed in the `$DATADIR` folder. Here are some examples:

Call a list of available files:
```
curl -H "Token: $TOKEN" "https://feeds.spur.us/v2/$FEED_TYPE/"
```

Get the latest feed file and save it to latest.json.gz:
```
curl -o latest.json.gz -L -H "Token: $TOKEN" "https://feeds.spur.us/v2/$FEED_TYPE/latest.json.gz"`
```

Get a feed file from a certain date and save it to a gzipped JSON:
```
curl -L -H "Token: $TOKEN" -o 20230719.feed.json.gz "https://feeds.spur.us/v2/$FEED_TYPE/20230719/feed.json.gz"
```

**Data from Spur (script)**

IPoid uses `get-feed.js` to programmatically retrive updates and this script can be called manually. To use it, an API key needs to be set up in the `docker-composer.override.yml` file. See `docker-compose.override.yml.example` for the syntax.

`get-feed.js` takes 1 argument, the date in `yyyymmdd` format, and uses it to download that day's feed to `$DATADIR/$DATE.json.gz`.

Run it with:
```
node -e "require('./get-feed.js').init(date);"
```

|Parameter|Input|Optional|
|---|---|---|
|(1st arg)|Date as `yyyymmdd`|No|

Note that in production, `get-feed` will require a proxy to make requests to spur.us. To use a proxy, set `HTTPS_PROXY`
environment variable, for example, `HTTPS_PROXY: http://url-downloader.eqiad.wikimedia.org:8080`

### Running Scripts

IPoid uses a series of discrete scripts that should be chained together into a full pipeline (feed => database). The full pipeline can be run with `./main.sh`, which can be used to do an initial import of data or update existing data. It tries to be flexible in the parameters it accepts but rigid in how it expects to use them.

Doing an initial import (See 'Initialization scripts' for database setup scripts):

```
./main.sh --init true
OR
./main.sh --init true --today {YYYYMMDD}
```

Doing an update:
```
./main.sh --today {YYYYMMDD} --yesterday {YYYYMMDD}
OR
./main.sh --today {YYYYMMDD}
OR
./main.sh --yesterday {YYYYMMDD}
OR
./main.sh
```

For initial imports, if `--today` isn't passed as a parameter, `main.sh` will use the server's date to calculate `today`. For updates, If `main.sh` isn't passed both a `--today` and a `--yesterday` parameter, it uses bash's `date` utility to guess the related date.

`--debug true` can be passed to keep all intermediary files generated by the scripts. This is useful for debugging or updating tests. Set to any other value to delete intermediary files.

`--batchsize` takes an integer and defines the number of rows per batch. This defaults to 10000 and non-default values should only be used for development/testing, as there's no way for the database to know that the batch count is a non-default value.

|Parameter|Input|Optional|
|---|---|---|
|`init`|Declare if this is an initial import or not (only accepts `true` as a value)|Yes|
|`yesterday`|Yesterday's date in `YYYYMMDD` format|Yes|
|`today`|Today's date in `YYYYMMDD` format|Yes|
|`debug`|Keep intermediary files (if `true`)|No|
|`batchsize`|Number of rows per batch|No|

Note that passing `--init true` will also result in `init-db.js` being called with `shouldInit` true, which means that the script will drop and re-recreate the `ipoid` database.

#### Initialization scripts

**`create-users.js`**

Usage: `node -e "require('./create-users.js')();"`

Production uses a read/write user (`ipoid_rw`) and a read-only user (`ipoid_ro`). The maria-db configuration in `docker-compose.yml` will only take one user parameter and make it the root user. We keep that user because it’s useful for debugging on development. To mimic production, this script will create these users for the development environment and give them the appropriate permissions. These users are expected to be used in subsequent db operations from web.

**`init-db.js`**

Usage: `node -e "require('./init-db.js').init(shouldInit, updateName);"`

Takes 2 arguments, `shouldInit` and `updateName`. This script will initialize a database from scratch or apply updates to an existing database. `shouldInit` is a boolean that determines whether or not the script should drop the existing database and recreate it from scratch. `updateName` is a string that refers to the `name` of an update as declared in `./schema/updates.json`. If it's passed, the script will only apply that update. If neither `shouldInit` nor `updateName` are passed, the script attempts to apply all updates in the order they're declared, ignoring any that have already been run as recorded by the `update_log` table. If `shouldInit` is `true`, it's prioritized and `updateName` is ignored.

|Parameter|Input|Optional|
|---|---|---|
|`shouldInit`|Boolean declaring whether or not to create a database from scratch|No|
|`updateName`|String that refers to the `name` of an update that should be run|No|

#### Diffing Scripts

**`diff.sh`**

Usage: `./diff.sh --yesterday $PATH_TO_YESTERDAY_GZIP --today $PATH_TO_TODAY_GZIP`

|Parameter|Input|Optional|
|---|---|---|
|`yesterday`|Path to the gzipped file (preferably in `$DATADIR`) for yesterday's data|Yes|
|`today`|Path to the gzipped file (preferably in `$DATADIR`) for today's data or if no file for yesterday is passed, it'll be treated as the initial dataset|No|
|`debug`|Keep intermediary files (if `true`)|No|

Takes 2 arguments, yesterday’s feed and today’s feed. Using those, calculate the difference between them and output a sql file (`$DATADIR/statement.sql`) containing every update that has to be made to the db. Under the hood, this runs `output-sql.js`.

**`output-sql.js`**

Usage: `node -e "require('./output-sql.js')(filePath, mode);"`
|Parameter|Input|Optional|
|---|---|---|
|1st arg|Path to either a JSON (initial data import) or a sorted newline-delineated list of updates|No|
|2nd arg|Mode, either `import` or `diff`. Defaults to `diff` if mode is undefined or unrecognized.|Yes|

This gets run by `diff.sh` to generate the sql file based on the sorted results of `yesterday_today.unique.sorted`, a hardcoded file generated by `diff.sh`, _or_ a JSON representing the initial dataset to import. This script is meant to be used internally but can be run independently and doing so can be useful for debugging.

#### Import scripts

**`get-properties.js`**

Usage: `node -e "require('./get-properties.js')(filePath);"`

|Parameter|Input|Optional|
|---|---|---|
|`filePath`|Path to either a JSON (initial data import) or a sorted newline-delineated list of updates|No|

Outputs a JSON file that describes the properties (behaviors, proxies, and risks) to be imported into the db. The output file serves as an input file to import-properties.js.

**`import-properties.js`**

Usage: `node -e "require('./import-properties.js')(filePath, debugEnabled);"`

|Parameter|Input|Optional|
|---|---|---|
|`filePath`|Path to JSON file to be imported into the db|No|
|`debugEnabled`|Keep intermediary files (only accepts `true` as a value)|Yes|

Takes 1 argument, a JSON file that describes the properties (behaviors, proxies, and risks) to be imported into the db. This must be run before any imports from the feed can be run because the properties must exist to be associated with the actors.

**`import.sh`**

Usage: `./import.sh` $SLEEP_BETWEEN_BATCHES

|Parameter|Input|Optional|
|---|---|---|
|`debug`|Keep intermediary files (if `true`)|No|
|`batchsize`|Number of rows per batch|No|

Takes an optional argument, an integer representing the number of seconds to sleep between batches. It looks for the hardcoded file, `$DATADIR/statements.sql`, splits it into batches, and then runs `update-db.js` on each file, sleeping as necessary in between.

The number of lines included in each batch is controlled via the `$BATCH_SIZE` environment variable, which prioritizes the passed along parameter `--batchsize` over the environment variable `$BATCH_SIZE` with a final fallback to a default value of 10,000.

**`update-db.js`**

Usage: `node -e "require('./update-db.js')(filePath);"`

|Parameter|Input|Optional|
|---|---|---|
|(1st arg)|Path to sql file to be imported into the db|No|

Takes 1 argument, the path to an sql file and import it into the db. This is meant to be used internally by `import.sh` but can be run independently.

## Tests

To run basic unit tests, run:

```
npm test
```

If you haven't changed anything in the code (and you have a working Internet
connection), you should see all the tests passing.

### Updating Tests

Occasionally new test cases have to be added to the test suite to help guard against regressions. To add a new test case to the fake data, update the data files (`./test/data/20000101_fake.json` and/or `./test/data/20000102_fake.json`) and run generate-test-files.sh. This updates the intermediary files used by various tests. Developers should be careful to check the diffs of the new files, to check that the tested behaviour is correct.

## Docker

The `docker-start` and `docker-test` scripts are deprecated, and only remain for backwards compatibility. Instead, developers should configure `.pipeline/blubber.yaml` and install [Blubber](https://github.com/wikimedia/blubber) to generate the desired Dockerfile.

To see the Dockerfile generated by blubber, ensure the blubber CLI is setup and execute:
```
blubber .pipeline/blubber.yaml {variant}
```
where variant is one of either build, development, test, etc. in `blubber.yaml`.

In place of `docker-test`, to run your service's tests, execute:
```
blubber .pipeline/blubber.yaml test | docker build --tag service-test --file - .
```

```
docker run service-test
```

In place of `docker-start`, to run your service, execute:
```
blubber .pipeline/blubber.yaml production | docker build --tag service-node --file - .
```
```
docker run service-node
```

## Troubleshooting

In a lot of cases when there is an issue with node it helps to recreate the
`node_modules` directory:

```
rm -r node_modules
npm install
```