Skip to content
Snippets Groups Projects
README.md 11.9 KiB
Newer Older
STran's avatar
STran committed
https://phabricator.wikimedia.org/T325147

IPoid (Wikitech: [IPoid](https://wikitech.wikimedia.org/wiki/IPoid), Gerrit: [mediawiki/services/ipoid](https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/ipoid,general)) is a node-based service. It offers 2 basic functionalities:
STran's avatar
STran committed

**Local storage for Spur data**

IPoid calls Spur to fetch a large (~700MB) gzipped JSON. In turn that data is processed and stored in its MariaDB database. This gzipped data unzips to ~4GB with the row count in the order of millions.
STran's avatar
STran committed

**Regular data updates**

TODO: IPoid should update its database daily.
This is using the template service-runner template (https://github.com/wikimedia/service-template-node)
STran's avatar
STran committed
---

Interaction with MediaWiki
IPoid's REST API is accessed via a MediaWiki extension [SecurityApi](https://www.mediawiki.org/wiki/Extension:SecurityApi). It will receive requests for this data from MediaWiki and provides access through its RESTful API.
STran's avatar
STran committed

---

Mstyles's avatar
Mstyles committed

## Running the application

STran's avatar
STran committed
### Baremetal

Mstyles's avatar
Mstyles committed
```
npm start
```

This starts an HTTP server listening on `localhost:6927`. There are several
routes you may query (with a browser, or `curl` and friends):

* `http://localhost:6927/_info/`
* `http://localhost:6927/_info/name`
* `http://localhost:6927/_info/version`
* `http://localhost:6927/_info/home`
* `http://localhost:6927/{domain}/v1/siteinfo{/prop}`
* `http://localhost:6927/{domain}/v1/page/{title}`
* `http://localhost:6927/{domain}/v1/page/{title}/lead`
* `http://localhost:6927/ex/err/array`
* `http://localhost:6927/ex/err/file`
* `http://localhost:6927/ex/err/manual/error`
* `http://localhost:6927/ex/err/manual/deny`
* `http://localhost:6927/ex/err/auth`

STran's avatar
STran committed
### With Docker

```
docker-compose up --detach
```

Two containers should start:
```
Creating network "ipoid_default" with the default driver
Creating ipoid_db_1 ... done
Creating ipoid_web_1 ... done
```

#### Configs

Edits to configs for local use should be made in `docker-compose.override.yml`. Most configs are good out of the box but you'll need to update the following:

|Config|Value|
|---|---|
|`SPUR_API_KEY`|API key|

## Getting Started

### Getting Data

**Data from Spur (manual)**

Feed files can be retrieved from Spur using an API key (keep this safe!) for local use. These files should ideally be placed in the `$DATADIR` folder. Here are some examples:
STran's avatar
STran committed

Call a list of available files:
```
curl -H "Token: $TOKEN" "https://feeds.spur.us/v2/$FEED_TYPE/"
```

Get the latest feed file and save it to latest.json.gz:
```
curl -o latest.json.gz -L -H "Token: $TOKEN" "https://feeds.spur.us/v2/$FEED_TYPE/latest.json.gz"`
```

Get a feed file from a certain date and save it to a gzipped JSON:
STran's avatar
STran committed
```
curl -L -H "Token: $TOKEN" -o 20230719.feed.json.gz "https://feeds.spur.us/v2/$FEED_TYPE/20230719/feed.json.gz"
```

**Data from Spur (script)**

IPoid uses `get-feed.js` to programmatically retrive updates and this script can be called manually. To use it, an API key needs to be set up in the `docker-composer.override.yml` file. See `docker-compose.override.yml.example` for the syntax.

`get-feed.js` takes 1 argument, the date in `yyyymmdd` format, and uses it to download that day's feed to `$DATADIR/$DATE.json.gz`.
STran's avatar
STran committed

Run it with:
```
node -e "require('./get-feed.js').init(date);"
STran's avatar
STran committed
```

|Parameter|Input|Optional|
|---|---|---|
|(1st arg)|Date as `yyyymmdd`|No|
Note that in production, `get-feed` will require a proxy to make requests to spur.us. To use a proxy, set `HTTPS_PROXY`
environment variable, for example, `HTTPS_PROXY: http://url-downloader.eqiad.wikimedia.org:8080`
STran's avatar
STran committed
### Running Scripts

STran's avatar
STran committed
IPoid uses a series of discrete scripts that should be chained together into a full pipeline (feed => database). The full pipeline can be run with `./main.sh`, which can be used to do an initial import of data or update existing data. It tries to be flexible in the parameters it accepts but rigid in how it expects to use them.

Doing an initial import (See 'Initialization scripts' for database setup scripts):

```
./main.sh --init true
OR
./main.sh --init true --today {YYYYMMDD}
```

Doing an update:
```
./main.sh --today {YYYYMMDD} --yesterday {YYYYMMDD}
OR
./main.sh --today {YYYYMMDD}
OR
./main.sh --yesterday {YYYYMMDD}
OR
./main.sh
```

For initial imports, if `--today` isn't passed as a parameter, `main.sh` will use the server's date to calculate `today`. For updates, If `main.sh` isn't passed both a `--today` and a `--yesterday` parameter, it uses bash's `date` utility to guess the related date.

`--debug true` can be passed to keep all intermediary files generated by the scripts. This is useful for debugging or updating tests. Set to any other value to delete intermediary files.
STran's avatar
STran committed
`--batchsize` takes an integer and defines the number of rows per batch. This defaults to 10000 and non-default values should only be used for development/testing, as there's no way for the database to know that the batch count is a non-default value.
STran's avatar
STran committed
|Parameter|Input|Optional|
|---|---|---|
|`init`|Declare if this is an initial import or not (only accepts `true` as a value)|Yes|
|`yesterday`|Yesterday's date in `YYYYMMDD` format|Yes|
|`today`|Today's date in `YYYYMMDD` format|Yes|
|`debug`|Keep intermediary files (if `true`)|No|
STran's avatar
STran committed
|`batchsize`|Number of rows per batch|No|
STran's avatar
STran committed

Note that passing `--init true` will also result in `init-db.js` being called with `shouldInit` true, which means that the script will drop and re-recreate the `ipoid` database.

STran's avatar
STran committed
#### Initialization scripts

**`create-users.js`**

Usage: `node -e "require('./create-users.js')();"`
STran's avatar
STran committed

Production uses a read/write user (`ipoid_rw`) and a read-only user (`ipoid_ro`). The maria-db configuration in `docker-compose.yml` will only take one user parameter and make it the root user. We keep that user because it’s useful for debugging on development. To mimic production, this script will create these users for the development environment and give them the appropriate permissions. These users are expected to be used in subsequent db operations from web.

**`init-db.js`**

Usage: `node -e "require('./init-db.js').init(shouldInit, updateName);"`
STran's avatar
STran committed

STran's avatar
STran committed
Takes 2 arguments, `shouldInit` and `updateName`. This script will initialize a database from scratch or apply updates to an existing database. `shouldInit` is a boolean that determines whether or not the script should drop the existing database and recreate it from scratch. `updateName` is a string that refers to the `name` of an update as declared in `./schema/updates.json`. If it's passed, the script will only apply that update. If neither `shouldInit` nor `updateName` are passed, the script attempts to apply all updates in the order they're declared, ignoring any that have already been run as recorded by the `update_log` table. If `shouldInit` is `true`, it's prioritized and `updateName` is ignored.

|Parameter|Input|Optional|
|---|---|---|
|`shouldInit`|Boolean declaring whether or not to create a database from scratch|No|
|`updateName`|String that refers to the `name` of an update that should be run|No|
STran's avatar
STran committed

#### Diffing Scripts

**`diff.sh`**

Usage: `./diff.sh --yesterday $PATH_TO_YESTERDAY_GZIP --today $PATH_TO_TODAY_GZIP`

|Parameter|Input|Optional|
|---|---|---|
|`yesterday`|Path to the gzipped file (preferably in `$DATADIR`) for yesterday's data|Yes|
|`today`|Path to the gzipped file (preferably in `$DATADIR`) for today's data or if no file for yesterday is passed, it'll be treated as the initial dataset|No|
|`debug`|Keep intermediary files (if `true`)|No|
STran's avatar
STran committed

Takes 2 arguments, yesterday’s feed and today’s feed. Using those, calculate the difference between them and output a sql file (`$DATADIR/statement.sql`) containing every update that has to be made to the db. Under the hood, this runs `output-sql.js`.
STran's avatar
STran committed

**`output-sql.js`**

Usage: `node -e "require('./output-sql.js')(filePath, mode);"`
STran's avatar
STran committed
|Parameter|Input|Optional|
|---|---|---|
|1st arg|Path to either a JSON (initial data import) or a sorted newline-delineated list of updates|No|
|2nd arg|Mode, either `import` or `diff`. Defaults to `diff` if mode is undefined or unrecognized.|Yes|
STran's avatar
STran committed

This gets run by `diff.sh` to generate the sql file based on the sorted results of `yesterday_today.unique.sorted`, a hardcoded file generated by `diff.sh`, _or_ a JSON representing the initial dataset to import. This script is meant to be used internally but can be run independently and doing so can be useful for debugging.
STran's avatar
STran committed

#### Import scripts

Usage: `node -e "require('./get-properties.js')(filePath);"`

|Parameter|Input|Optional|
|---|---|---|
|`filePath`|Path to either a JSON (initial data import) or a sorted newline-delineated list of updates|No|

Outputs a JSON file that describes the properties (behaviors, proxies, and risks) to be imported into the db. The output file serves as an input file to import-properties.js.

STran's avatar
STran committed

Usage: `node -e "require('./import-properties.js')(filePath, debugEnabled);"`
STran's avatar
STran committed

|Parameter|Input|Optional|
|---|---|---|
|`filePath`|Path to JSON file to be imported into the db|No|
|`debugEnabled`|Keep intermediary files (only accepts `true` as a value)|Yes|
STran's avatar
STran committed

Takes 1 argument, a JSON file that describes the properties (behaviors, proxies, and risks) to be imported into the db. This must be run before any imports from the feed can be run because the properties must exist to be associated with the actors.

**`import.sh`**

Usage: `./import.sh` $SLEEP_BETWEEN_BATCHES
STran's avatar
STran committed

|Parameter|Input|Optional|
|---|---|---|
|`debug`|Keep intermediary files (if `true`)|No|
STran's avatar
STran committed
|`batchsize`|Number of rows per batch|No|
STran's avatar
STran committed

Takes an optional argument, an integer representing the number of seconds to sleep between batches. It looks for the hardcoded file, `$DATADIR/statements.sql`, splits it into batches, and then runs `update-db.js` on each file, sleeping as necessary in between.
STran's avatar
STran committed
The number of lines included in each batch is controlled via the `$BATCH_SIZE` environment variable, which prioritizes the passed along parameter `--batchsize` over the environment variable `$BATCH_SIZE` with a final fallback to a default value of 10,000.
STran's avatar
STran committed

STran's avatar
STran committed

Usage: `node -e "require('./update-db.js')(filePath);"`
STran's avatar
STran committed

|Parameter|Input|Optional|
|---|---|---|
|(1st arg)|Path to sql file to be imported into the db|No|

Takes 1 argument, the path to an sql file and import it into the db. This is meant to be used internally by `import.sh` but can be run independently.

Mstyles's avatar
Mstyles committed
## Tests

To run basic unit tests, run:
Mstyles's avatar
Mstyles committed

```
npm test
```

If you haven't changed anything in the code (and you have a working Internet
connection), you should see all the tests passing.
### Updating Tests

Occasionally new test cases have to be added to the test suite to help guard against regressions. To add a new test case to the fake data, update the data files (`./test/data/20000101_fake.json` and/or `./test/data/20000102_fake.json`) and run generate-test-files.sh. This updates the intermediary files used by various tests. Developers should be careful to check the diffs of the new files, to check that the tested behaviour is correct.
Mstyles's avatar
Mstyles committed
## Docker

The `docker-start` and `docker-test` scripts are deprecated, and only remain for backwards compatibility. Instead, developers should configure `.pipeline/blubber.yaml` and install [Blubber](https://github.com/wikimedia/blubber) to generate the desired Dockerfile.

To see the Dockerfile generated by blubber, ensure the blubber CLI is setup and execute:
```
blubber .pipeline/blubber.yaml {variant}
```
where variant is one of either build, development, test, etc. in `blubber.yaml`.

In place of `docker-test`, to run your service's tests, execute:
```
blubber .pipeline/blubber.yaml test | docker build --tag service-test --file - .
```

```
docker run service-test
```

In place of `docker-start`, to run your service, execute:
```
blubber .pipeline/blubber.yaml production | docker build --tag service-node --file - .
```
```
docker run service-node
```

## Troubleshooting

In a lot of cases when there is an issue with node it helps to recreate the
`node_modules` directory:

```
rm -r node_modules
npm install
```