# Architecture

The architecture of the Black Swan application with all its components is illustrated below:

### Analysis

In order to discover potentially unusual statistical developments that may correlate with certain events, *Black Swan* applies different regression algorithms to the time series. These algorithms include:

**Linear regression:**This determines a linear approximation of the statistic and calculates the residual (the absolute vertical distance from the curve) for every point. An outlier is defined as a point whose residual is above a given threshold.**Non-polynomial regression:**This finds the best fitting curve for the given statistic and calculates the residual for every point. An outlier is defined again as a point whose residual is above a given threshold.**Slope analysis, extremum analysis, ML-based analysis, …**

A user can either choose to use a default set of analysis algorithms, or select one or more of the implemented algorithms yourself. Depending on the algorithms employed, different outliers may be detected. This in turn provides great flexibility in the quest for Black Swans.

### Extraction

Events causing outliers in a time series can be of various types. Therefore, event data from different categories is collected from the web and integrated into one database. This involves a number of integration steps:

**Mapping to a unique schema:**Because of the different sources, the events have different attributes that are mapped to a global schema. This requires the classification of the events which is also a prerequisite for association rule mining.**Associating a unique location:**Events are associated with a geolocation. However, the granularity of these geolocations may differ depending on the event: cities, countries or whole continents might be affected by it. Furthermore, an event in a city may also have an impact on the country the city is located in. Therefore, it is important to link an event with the location(s) it happened.**Removing duplicates:**The integration of various sources can produce duplicates in the database.*Black Swan*detects and merges them to present only relevant information to the user.

These steps aggregate and integrate many different historical events from the last two centuries and allow to inspect their influence on the development of different socioeconomic factors.

### Rule Mining

Not every (type of) event has an impact on a certain statistic. On the other hand, there can be previously unidentified correlations between events and statistics, i.e., true Black Swan events. The following aspects are included to automatically determine rules between statistics and events:

**Categories:**Statistics and events are categorized in the extraction process. This meta-information is used to find more general dependencies.**Outlier tendency:**Some events have a positive influence on a time series while others have a negative impact. To derive this influence the tendency of an outlier (i.e., is its value higher or lower than expected?) is taken into account by the mining process.**Outlier history:**Some events tend to occur with a specific previous development in the statistic. These correlations are found by evaluating the imminent history of an outlier in a time series.

Using pruning, irrelevant rules are eliminated. Afterwards, one is able to discover interesting connections between statistics and events that were previously only obvious to domain experts (or not at all).

### Visualization

*Black Swan* is developed as a web application using the Google Web Toolkit. In addition to presenting all related events for an outlier, the rules indicating the relevance of the occurrences are listed. The visualization provides two methods for exploring the provided data:

**Based on a statistic:**Select a statistic (and country) and find the events that match to its outliers as well as the rules that produced these pairs.**Based on a rule:**Select a rule and find the matching pairs of statistical outliers and historical events.

With these navigation methods it is easy to access the desired information and to explore the available data to discover new insights.

### Libraries

The following libraries are used for the *Black Swan* main application:

- DB2JCC (DB2 JDBC driver)
- JRI (Java/R interface)
- Lucene Snowball (Stemmer)
- WEKA (Machine Learning)

In addition, the *Black Swan *web application utilizes the following libraries:

The code of the *Black Swan* prototype is open source and available by request.