loci package

Submodules

loci.analytics module

loci.analytics.bbox(gdf)[source]

Computes the bounding box of a GeoDataFrame.

Parameters

gdf (GeoDataFrame) – A GeoDataFrame.

Returns

A Polygon representing the bounding box enclosing all geometries in the GeoDataFrame.

loci.analytics.filter_by_kwd(df, kwd_filter, col_kwds='kwds')[source]

Returns a DataFrame with only those rows that contain the specified keyword.

Parameters
  • df (DataFrame) – The initial DataFrame to be filtered.

  • kwd_filter (string) – The keyword to use for filtering.

  • col_kwds (string) – Name of the column containing the keywords (default: kwds).

Returns

A GeoDataFrame with only those rows that contain kwd_filter.

loci.analytics.freq_locationsets(location_visits, location_id_col, locations, locationset_id_col, min_sup, min_length)[source]

Computes frequently visited sets of locations based on frequent itemset mining.

Parameters
  • location_visits (DataFrame) – A DataFrame with location ids and locationset ids.

  • location_id_col (String) – The name of the column containing the location ids.

  • locationset_id_col (String) – The name of the column containing the locationsets ids.

  • locations (GeoDataFrame) – A GeoDataFrame containing the geometries of the locations.

  • min_sup (float) – The minimum support threshold.

  • min_length (int) – Minimum length of itemsets to be returned.

Returns

A GeoDataFrame with the support, length and geometry of the computed location sets.

loci.analytics.kwds_freq(gdf, col_kwds='kwds', normalized=False)[source]

Computes the frequency of keywords in the provided GeoDataFrame.

Parameters
  • gdf (GeoDataFrame) – A GeoDataFrame with a keywords column.

  • col_kwds (string) – The column containing the list of keywords (default: kwds).

  • normalized (bool) – If True, the returned frequencies are normalized in [0,1] by dividing with the number of rows in gdf (default: False).

Returns

A dictionary containing for each keyword the number of rows it appears in.

loci.brs module

loci.brs.checkCohesiveness(gdf, p, items, eps)[source]
loci.brs.compute_score(init, params)[source]

Computes the score of a distribution.

Parameters
  • init – A vector containing the values of the type distribution.

  • params – Configuration parameters.

Returns

Computed score and relative entropy.

loci.brs.create_graph(gdf, eps)[source]

Creates the spatial connectivity graph.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • eps – The spatial distance threshold for edge creation.

Returns

A NetworkX graph and an R-tree index over the points.

loci.brs.expand_region(G, region_core, region_border, nodes_to_expand, params, types)[source]

Expands a given region by adding the given set of nodes.

Parameters
  • G – The spatial connectivity graph over the input points.

  • region_core – The set of core points of the region.

  • region_border – The set of border points of the region.

  • nodes_to_expand – The set of points to be added.

  • params – The configuration parameters.

  • types – The set of distinct point types.

Returns

The expanded region and its score.

loci.brs.get_neighbors(G, region)[source]
loci.brs.get_region_score_graph(G, types, region, params)[source]
loci.brs.get_types(gdf)[source]

Extracts the types of points and assigns a random color to each type.

Parameters

gdf – A GeoDataFrame containing the input points.

Returns

Set of types and corresponding colors.

loci.brs.init_queue(G, seeds, types, params, topk_regions, start_time, updates)[source]

Initializes the priority queue used for exploration.

Parameters
  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • types – The set of distinct point types.

  • params – The configuration parameters.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

A priority queue to drive the expansion process.

loci.brs.neighbor_extension(G, region)[source]
loci.brs.pickSeeds(gdf, seeds_ratio)[source]

Selects seed points to be used by the ExpCircles algorithm.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • seeds_ratio – Percentage of points to be used as seeds.

Returns

Set of seed points.

loci.brs.process_queue(G, queue, topk_regions, params, types, start_time, updates)[source]

Selects and expands the next region in the queue.

Parameters
  • G – The spatial connectivity graph over the input points.

  • queue – A priority queue of candidate regions.

  • top_regions – A list to hold the top-k results.

  • params – The configuration parameters.

  • types – The set of distinct point types.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The new state after the expansion.

loci.brs.run_exp_circles(gdf, rtree, G, seeds, params, types, topk_regions, start_time, updates)[source]

Executes the ExpCircles algorithm.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • rtree – The R-tree index constructed over the input points.

  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • params – The configuration parameters.

  • types – The set of distinct point types.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The list of top-k regions found within the given time budget.

loci.brs.run_exp_hybrid(G, seeds, params, types, topk_regions, start_time, updates)[source]

Executes the ExpHybrid algorithm.

Parameters
  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • params – The configuration parameters.

  • types – The set of distinct point types.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The list of top-k regions found within the given time budget.

loci.brs.run_mbrs(gdf, G, rtree, types, params, seeds)[source]

Computes the top-k high/low mixture regions.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • G – The spatial connectivity graph over the input points.

  • rtree – The R-tree index constructed over the input points.

  • types – The set of distinct point types.

  • params – The configuration parameters.

  • seeds – The set of seeds to be used.

Returns

The list of top-k regions found within the given time budget.

loci.brs.show_map(gdf, region, colors)[source]
loci.brs.show_map_topk_convex_regions(gdf, colors, topk_regions)[source]

Draws the convex hull around the points per region on the map. Each point is rendered with a color based on its type.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • colors – A list containing the color corresponding to each type.

  • topk_regions – The list of top-k regions to be displayed.

Returns

A map displaying the top-k regions.

loci.brs.update_topk_list(topk_regions, region_core, region_border, rel_se, score, init, params, start_time, updates)[source]

loci.clustering module

loci.clustering.cluster_shapes(pois, shape_type=1, eps_per_cluster=None)[source]

Computes cluster shapes.

Parameters
  • pois (GeoDataFrame) – The clustered POIs.

  • shape_type (integer) – The methods to use for computing cluster shapes (allowed values: 1-3).

  • eps_per_cluster (DataFrame) – The value of parameter eps used for each cluster (required by methods 2 and 3).

Returns

A GeoDataFrame containing the cluster shapes.

loci.clustering.compute_clusters(pois, alg='dbscan', min_pts=None, eps=None, sample_size=- 1, kwd=None, n_jobs=1)[source]

Computes clusters using the DBSCAN or the HDBSCAN algorithm.

Parameters
  • pois (GeoDataFrame) – A POI GeoDataFrame.

  • alg (string) – The clustering algorithm to use (dbscan or hdbscan; default: dbscan).

  • min_pts (integer) – The minimum number of neighbors for a dense point.

  • eps (float) – The neighborhood radius.

  • sample_size (int) – Sample size (default: -1; show all).

  • kwd (string) – A keyword to filter by (optional).

  • n_jobs (integer) – Number of parallel jobs to run in the algorithm (default: 1)

Returns

A GeoDataFrame containing the clustered POIs and their labels. The value of parameter eps for each cluster is also returned (which varies in the case of HDBSCAN).

loci.index module

loci.index.grid(pois, cell_width=None, cell_height=None, cell_size_ratio=0.01, znorm=False, neighborhood=False)[source]

Constructs a uniform grid from the given POIs.

If cell_width and cell_height are provided, each grid cell has size cell_width * cell_height. Otherwise, cell_width = cell_size_ratio * area_width and cell_height = cell_size_ratio * area_height, where area refers to the bounding box of pois.

Each cell is assigned a score, which is the number of points within that cell.

If neighborhood is True, each cell is assigned an additional score (score_nb), which is the total number of points within that cell and its adjacent cells.

If znorm is True, the above scores are also provided in their z-normalized variants, score_znorm and score_nb_znorm.

The constructed grid is represented by a GeoDataFrame where each row corresponds to a grid cell and contains the following columns:

  • cell_id: The id of the cell (integer computed as: cell_x * num_columns + cell_y)

  • cell_x: The row of the cell in the grid (integer).

  • cell_y: The column of the cell in the grid (integer).

  • score: see above

  • score_nb: see above

  • score_znorm: see above

  • score_nb_znorm: see above

  • ‘contents’: list of points in the cell.

  • ‘geometry’: Geometry column of the GeoDataFrame that contains the polygon representing the cell boundaries.

Parameters
  • pois (GeoDataFrame) – a POIs GeoDataFrame.

  • cell_width (float) – cell width.

  • cell_height (float) – cell height.

  • cell_size_ratio (float) – ratio of cell width and height to area width and height (default: 0.01).

  • znorm (bool) – Whether to include z-normalized scores (default: False).

  • neighborhood (bool) – Whether to include a total score including adjacent cells (default: False).

Returns

A GeoDataFrame as described above.

loci.io module

loci.io.crop(gdf, min_lon, min_lat, max_lon, max_lat)[source]

Crops the given GeoDataFrame according to the given bounding box.

Parameters
  • gdf (GeoDataFrame) – The original GeoDataFrame.

  • min_lon (floats) – The bounds.

  • min_lat (floats) – The bounds.

  • max_lon (floats) – The bounds.

  • max_lat (floats) – The bounds.

Returns

The cropped GeoDataFrame.

loci.io.import_osmnx(bound, target_crs='EPSG:4326')[source]

Creates a POI GeoDataFrame from POIs retrieved by OSMNX (https://github.com/gboeing/osmnx).

Parameters
  • bound (polygon) – A polygon to be used as filter.

  • target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).

Returns

A POI GeoDataFrame with columns id, name and kwds.

loci.io.import_osmwrangle(osmwrangle_file, target_crs='EPSG:4326', bound=None)[source]

Creates a POI GeoDataFrame from a file produced by OSMWrangle (https://github.com/SLIPO-EU/OSMWrangle).

Parameters
  • osmwrangle_file (string) – Path or URL to the input csv file.

  • target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).

  • bound (polygon) – A polygon to be used as filter.

Returns

A POI GeoDataFrame with columns id, name and kwds.

loci.io.read_csv(input_file, sep=',', col_id='id', col_name='name', col_lon='lon', col_lat='lat', col_kwds='keywords', kwds_sep=';', source_crs='EPSG:4326', target_crs='EPSG:4326')[source]

Create a DataFrame from a CSV file and then convert to GeoDataFrame.

Parameters
  • input_file (string) – Path to the input CSV file.

  • sep (string) – Column delimiter (default: ;).

  • col_id (string) – Name of the column containing the id (default: id).

  • col_name (string) – Name of the column containing the name (default: name).

  • col_lon (string) – Name of the column containing the longitude (default: lon).

  • col_lat (string) – Name of the column containing the latitude (default: lat).

  • col_kwds (string) – Name of the column containing the keywords (default: kwds).

  • kwds_sep (string) – Keywords delimiter (default: ;).

  • source_crs (string) – Coordinate Reference System of input data (default: EPSG:4326).

  • target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).

Returns

A GeoDataFrame.

loci.io.read_poi_csv(input_file, col_id='id', col_name='name', col_lon='lon', col_lat='lat', col_kwds='kwds', col_sep=';', kwds_sep=',', source_crs='EPSG:4326', target_crs='EPSG:4326', keep_other_cols=False)[source]

Creates a POI GeoDataFrame from an input CSV file.

Parameters
  • input_file (string) – Path to the input csv file.

  • col_id (string) – Name of the column containing the POI id (default: id).

  • col_name (string) – Name of the column containing the POI name (default: name).

  • col_lon (string) – Name of the column containing the POI longitude (default: lon).

  • col_lat (string) – Name of the column containing the POI latitude (default: lat).

  • col_kwds (string) – Name of the column containing the POI keywords (default: kwds).

  • col_sep (string) – Column delimiter (default: ;).

  • kwds_sep (string) – Keywords delimiter (default: ,).

  • source_crs (string) – Coordinate Reference System of input data (default: EPSG:4326).

  • target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).

  • keep_other_cols (bool) – Whether to keep the rest of the columns in the csv file (default: False).

Returns

A POI GeoDataFrame with columns id, name and kwds.

loci.io.retrieve_osm_loc(name, buffer_dist=0)[source]

Retrieves a polygon from an OSM location.

Parameters
  • name (string) – Name of the location to be resolved.

  • buffer_dist (numeric) – Buffer distance in meters.

Returns

A polygon.

loci.io.to_geojson(gdf, output_file)[source]

Exports a GeoDataFrame to a GeoJSON file.

Parameters
  • gdf (GeoDataFrame) – The GeoDataFrame object to be exported.

  • output_file (string) – Path to the output file.

loci.mbrs module

loci.mbrs.check_cohesiveness(gdf, p, region, eps)[source]

Checks if point p is within distance eps from at least one of the points in the region.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • p – Location of the point to examine.

  • region – A list with the the identifiers of the points currently in the region.

  • eps – The distance threshold.

Returns

A Boolean value.

loci.mbrs.compute_score(init, region_size, params)[source]

Computes the score of a distribution.

Parameters
  • init – A vector containing the values of the type distribution.

  • region_size – The number of points that constitute the region.

  • params – Configuration parameters.

Returns

Computed score and relative entropy.

loci.mbrs.create_graph(gdf, eps)[source]

Creates the spatial connectivity graph.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • eps – The spatial distance threshold for edge creation.

Returns

A NetworkX graph and an R-tree index over the points.

loci.mbrs.expand_region(G, region_core, region_border, nodes_to_expand, params, types)[source]

Expands a given region by adding the given set of nodes.

Parameters
  • G – The spatial connectivity graph over the input points.

  • region_core – The set of core points of the region.

  • region_border – The set of border points of the region.

  • nodes_to_expand – The set of points to be added.

  • params – The configuration parameters.

  • types – The set of distinct point types.

Returns

The expanded region and its score.

loci.mbrs.expand_region_with_neighbors(G, region)[source]

Expands a given region with its neighboring nodes according to the graph.

Parameters
  • G – The spatial connectivity graph over the input points.

  • region – The set of points currently in the region.

Returns

The expanded region.

loci.mbrs.get_region_score(G, types, region, params)[source]

Computes the score of the given region according to the connectivity graph.

Parameters
  • G – The spatial connectivity graph over the input points.

  • types – The set of distinct point types.

  • region – The set of points in the region.

  • params – The configuration parameters.

Returns

The score of the region, its relative entropy, and a vector with the values of POI type distribution .

loci.mbrs.get_types(gdf)[source]

Extracts the types of points and assigns a random color to each type.

Parameters

gdf – A GeoDataFrame containing the input points.

Returns

Set of types and corresponding colors.

loci.mbrs.init_queue(G, seeds, types, params, topk_regions, start_time, updates)[source]

Initializes the priority queue used for exploration.

Parameters
  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • types – The set of distinct point types.

  • params – The configuration parameters.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

A priority queue to drive the expansion process.

loci.mbrs.partition_data_in_grid(gdf, cell_size)[source]

Partitions a GeoDataFrame of points into a uniform grid of square cells.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • cell_size – The size of the square cell (same units as the coordinates in the input data).

Returns

An R-tree index over the input points; also, a GeoDataFrame representing the centroids of the non-empty cells of the grid.

loci.mbrs.pick_seeds(gdf, seeds_ratio)[source]

Selects seed points to be used by the ExpCircles algorithm.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • seeds_ratio – Percentage of points to be used as seeds.

Returns

Set of seed points.

loci.mbrs.process_queue(G, queue, topk_regions, params, types, start_time, updates)[source]

Selects and expands the next region in the queue.

Parameters
  • G – The spatial connectivity graph over the input points.

  • queue – A priority queue of candidate regions.

  • top_regions – A list to hold the top-k results.

  • params – The configuration parameters.

  • types – The set of distinct point types.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The new state after the expansion.

loci.mbrs.run(gdf, G, rtree, types, params, eps)[source]

Computes the top-k high/low mixture regions.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • G – The spatial connectivity graph over the input points.

  • rtree – The R-tree index constructed over the input points.

  • types – The set of distinct point types.

  • params – The configuration parameters.

  • eps – The distance threshold.

Returns

The list of top-k regions detected within the given time budget.

loci.mbrs.run_exp_circles(gdf, rtree, G, seeds, params, eps, types, topk_regions, start_time, updates)[source]

Executes the ExpCircles algorithm. Employes a priority queue of seeds and expands search in circles of increasing radii around each seed.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • rtree – The R-tree index constructed over the input points.

  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • params – The configuration parameters.

  • eps – The distance threshold.

  • types – The set of distinct point types.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The list of top-k regions found within the given time budget.

loci.mbrs.run_exp_hybrid(G, seeds, params, types, topk_regions, start_time, updates)[source]

Executes the ExpHybrid algorithm.

Parameters
  • G – The spatial connectivity graph over the input points.

  • seeds – The set of seeds to be used.

  • params – The configuration parameters.

  • types – The set of distinct point types.

  • top_regions – A list to hold the top-k results.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The list of top-k regions found within the given time budget.

loci.mbrs.show_map(gdf, region, colors)[source]

Draws the points belonging to a single region on the map. Each point is rendered with a color based on its type.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • region – The region to be displayed, i.e., a list of the identifiers of its constituent points.

  • colors – A list containing the color corresponding to each type.

Returns

A map displaying the top-k regions.

loci.mbrs.show_map_topk_convex_regions(gdf, colors, topk_regions)[source]

Draws the convex hull around the points per region on the map. Each point is rendered with a color based on its type.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • colors – A list containing the color corresponding to each type.

  • topk_regions – The list of top-k regions to be displayed.

Returns

A map displaying the top-k regions.

loci.mbrs.show_map_topk_grid_regions(gdf, prtree, colors, gdf_grid, cell_size, topk_regions)[source]

Draws the points per grid-based region on the map. Each point is rendered with a color based on its type.

Parameters
  • gdf – A GeoDataFrame containing the input points.

  • prtree – The R-tree index already constructed over the input points.

  • colors – A list containing the color corresponding to each type.

  • gdf_grid – The grid partitioning (cell centroids with their POI types) created over the input points.

  • cell_size – The size of the square cell in the applied grid partitioning (user-specified distance threshold eps).

  • topk_regions – The list of top-k grid-based regions to be displayed.

Returns

A map displaying the top-k regions along with the grid cells constituting each region.

loci.mbrs.update_topk_list(topk_regions, region_core, region_border, rel_se, score, init, params, start_time, updates)[source]

Checks and updates the list of top-k region with a candidate region.

Parameters
  • topk_regions – The current list of top-k best regions.

  • region_core – The set of core points of the candidate region.

  • region_border – The set of border points of the candidate region.

  • rel_se – The relative entropy of the candidate region.

  • score – The score of the candidate region.

  • init – A vector containing the values of the type distribution of points the candidate region.

  • params – The configuration parameters.

  • start_time – The starting time of the execution.

  • updates – A structure to hold update times of new results.

Returns

The updated list of the top-k best regions.

loci.plots module

loci.plots.barchart(data, orientation='Vertical', x_axis_label='', y_axis_label='', plot_title='', bar_width=0.5, plot_width=15, plot_height=5, top_k=10)[source]

Plots a bar chart with the given data.

Parameters
  • data (dict) – The data to plot.

  • orientation (string) – The orientation of the bars in the plot (Vertical or Horizontal; default: Vertical).

  • x_axis_label (string) – Label of x axis.

  • y_axis_label (string) – Label of y axis.

  • plot_title (string) – Title of the plot.

  • bar_width (scalar) – The width of the bars (default: 0.5).

  • plot_width (scalar) – The width of the plot (default: 15).

  • plot_height (scalar) – The height of the plot (default: 5).

  • top_k (integer) – Top k results (if -1, show all; default: 10).

Returns

A Matplotlib plot displaying the bar chart.

loci.plots.heatmap(pois, sample_size=- 1, kwd=None, tiles='OpenStreetMap', width='100%', height='100%', radius=10)[source]

Generates a heatmap of the input POIs.

Parameters
  • pois (GeoDataFrame) – A POIs GeoDataFrame.

  • sample_size (int) – Sample size (default: -1; show all).

  • kwd (string) – A keyword to filter by (optional).

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

  • radius (float) – Radius of each point of the heatmap (default: 10).

Returns

A Folium Map object displaying the heatmap generated from the POIs.

loci.plots.map_choropleth(areas, id_field, value_field, fill_color='YlOrRd', fill_opacity=0.6, num_bins=5, tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map showing the clusters. Map center and zoom level are set automatically.

Parameters
  • areas (GeoDataFrame) – A GeoDataFrame containing the areas to be displayed.

  • id_field (string) – The name of the column to use as id.

  • value_field (string) – The name of the column indicating the area’s value.

  • fill_color (string) – A string indicating a Matplotlib colormap (default: YlOrRd).

  • fill_opacity (float) – Opacity level (default: 0.6).

  • num_bins (int) – The number of bins for the threshold scale (default: 5).

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the given clusters.

loci.plots.map_cluster_contents_osm(cluster_borders, tiles='OpenStreetMap', width='100%', height='100%')[source]

Constructs a Folium Map displaying the streets and buildings, retreived from OpenStreetMap via OSMNX, within a given AOI.

Parameters
  • cluster_borders (GeoDataFrame) – The cluster polygons.

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the retreived entities.

loci.plots.map_cluster_diff(clusters_a, clusters_b, intersection_color='#00ff00', diff_ab_color='#0000ff', diff_ba_color='#ff0000', tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map displaying the differences between two sets of clusters. Map center and zoom level are set automatically.

Parameters
  • clusters_a (GeoDataFrame) – The first set of clusters.

  • clusters_b (GeoDataFrame) – The second set of clusters.

  • intersection_color (color code) – The color to use for A & B.

  • diff_ab_color (color code) – The color to use for A - B.

  • diff_ba_color (color code) – The color to use for B - A.

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying cluster intersections and differences.

loci.plots.map_clusters_with_topics(clusters_topics, viz_type='dominant', col_id='cluster_id', col_dominant='Dominant Topic', colormap='tab10', red='Topic0', green='Topic1', blue='Topic2', single_topic='Topic0', tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map showing the clusters colored based on their topics.

Parameters
  • clusters_topics (GeoDataFrame) – A GeoDataFrame containing the clusters to be displayed and their topics.

  • viz_type (string) – Indicates how to assign colors based on topics. One of: ‘dominant’, ‘single’, ‘rgb’.

  • col_id (string) – The name of the column indicating the cluster id (default: cluster_id).

  • col_dominant (string) – The name of the column indicating the dominant topic (default: Dominant Topic).

  • colormap (string) – A string indicating a Matplotlib colormap (default: tab10).

  • red (string) – The name of the column indicating the topic to assign to red (default: Topic0).

  • green (string) – The name of the column indicating the topic to assign to green (default: Topic1).

  • blue (string) – The name of the column indicating the topic to assign to blue (default: Topic2).

  • single_topic (string) – The name of the column indicating the topic to use (default: Topic0).

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the given clusters colored by their topics.

loci.plots.map_geometries(gdf, tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map displaying the provided geometries. Map center and zoom level are set automatically.

Parameters
  • gdf (GeoDataFrame) – A GeoDataFrame containing the geometries to be displayed.

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the given geometries.

loci.plots.map_geometry(geom, tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map displaying the provided geometry. Map center and zoom level are set automatically.

Parameters
  • geom (Shapely Geometry) – A geometry to be displayed.

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the given geometry.

loci.plots.map_points(pois, sample_size=- 1, kwd=None, show_bbox=False, tiles='OpenStreetMap', width='100%', height='100%')[source]

Returns a Folium Map displaying the provided points. Map center and zoom level are set automatically.

Parameters
  • pois (GeoDataFrame) – A GeoDataFrame containing the POIs to be displayed.

  • sample_size (int) – Sample size (default: -1; show all).

  • kwd (string) – A keyword to filter by (optional).

  • show_bbox (bool) – Whether to show the bounding box of the GeoDataFrame (default: False).

  • tiles (string) – The tiles to use for the map (default: OpenStreetMap).

  • width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).

  • height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).

Returns

A Folium Map object displaying the given POIs.

loci.plots.plot_wordcloud(pois, bg_color='black', width=400, height=200)[source]

Generates and plots a word cloud from the keywords of the given POIs.

Parameters
  • pois (GeoDataFrame) – The POIs from which the keywords will be used to generate the word cloud.

  • bg_color (string) – The background color to use for the plot (default: black).

  • width (int) – The width of the plot.

  • height (int) – The height of the plot.

loci.set_evolution module

class loci.set_evolution.Change_Detector[source]

Bases: object

Change Detector class for studying evolving sets.

Parameters
  • similarities (dict) – map for caching group similarities, in the form of (group1, group2) -> sim

  • groups (dict) – 3d map, in the form of snapshot->group->member

  • matchings (dict) – map for caching snapshot similarities, in the form of (snapshot1, snapshot2) -> (group1, group2, sim)

  • inv_index (dict) – 3d map, in the form of member->snapshot->group

get_group_evolution(snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]

Calculate Group Evolution for a specific group and return how its members have been distributed into another snapshot. The status for the group is also labeled with a number: 0-> Similar, 1->Split, 2-> Diffused, which is determined by how many groups contain the fraction tau and whether it lies between alpha and beta.

Parameters
  • snap1 (str) – ID of 1st snapshot.

  • group1 (str) – ID of 1st group.

  • snap2 (str) – ID of 2nd snapshot.

  • tau (float, optional) – Percentage collected, defaults to 0.8

  • alpha (int, optional) – Lower Bound for decision, defaults to 1

  • beta (int, optional) – Upper Bound for decision, defaults to 3

Raises

ValueError – ID not in IDs.

Returns

(status, related_groups): Status, i.e 0-> Similar, 1->Split, 2-> Diffused. Groups, percentage in each group.

Return type

tuple

get_group_provenance(snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]

Calculate Group Provenance for a specific group and return how its members have been originated from another snapshot. The status for the group is also labeled with a number: 0-> Similar, 1->Merged, 2-> Accumulated, which is determined by how many groups contain the fraction tau and whether it lies between alpha and beta.

Parameters
  • snap1 (str) – ID of 1st snapshot.

  • group1 (str) – ID of 1st group.

  • snap2 (str) – ID of 2nd snapshot.

  • tau (float, optional) – Percentage collected, defaults to 0.8

  • alpha (int, optional) – Lower Bound for decision, defaults to 1

  • beta (int, optional) – Upper Bound for decision, defaults to 3

Raises

ValueError – ID not in IDs.

Returns

(status, related_groups): Status, i.e 0-> Similar, 1-> Merged, 2-> Accumulated. Groups, percentage in each group.

Return type

tuple

get_group_similarity(snap1, group1, snap2, group2)[source]

Calculate Similarity between 2 groups (Jaccard).

Parameters
  • snap1 (str) – ID of 1st snapshot.

  • group1 (str) – ID of 1st group.

  • snap2 (str) – ID of 2nd snapshot.

  • group2 (str) – ID of 2nd group.

Raises

ValueError – ID not in list of IDs.

Returns

Similarity of the two groups (jaccard).

Return type

int

get_groups(snap)[source]

Return the ids of groups inside a specific snapshot.

Parameters

snap (str) – ID of snapshot.

Raises

ValueError – ID not in snapshot IDs.

Returns

List of ids of groups.

Return type

list

get_groups_of_member(member)[source]

Return the ids of (snapshot,group) that a member belongs to.

Parameters

member (str) – ID of member.

Raises

ValueError – Member ID not in IDs.

Returns

List of ids of (snapshot,group).

Return type

list

get_member_comembers(member, snap)[source]

Return the ids of members in the same group as member in the specific snapshot.

Parameters
  • member (str) – ID of member.

  • snap – ID of snapshot.

Raises

ValueError – ID not in IDs.

Returns

List of ids of members.

Return type

list

get_member_evolution(member)[source]

Calculate Member Evolution and return the similarity scores from groups between subsequent snapshots, that the member belongs to.

Parameters

member (str) – ID of member.

Raises

ValueError – Member ID not in member IDs.

Returns

Evolution of member.

Return type

list

get_member_rules(member, min_support=0.07, min_threshold=1, metric='lift')[source]

Return rules from frequent subgroups mining for specific member.

Parameters
  • member (list, optional) – Member ID to use for group filtering.

  • min_support (float, optional) – A float between 0 and 1 for minumum support of the itemsets returned, defaults to 0.07

  • min_threshold (float, optional) – Minimal threshold for the evaluation metric, via the metric parameter, to decide whether a candidate rule is of interest, defaults to 1

  • metric (str, optional) – Metric to use: ‘lift’, ‘support’, ‘confidence’, ‘leverage’, ‘conviction’, defaults to ‘lift’

Returns

pandas DataFrame with columns “antecedents” and “consequents” that store itemsets, plus the scoring metric columns: “antecedent support”, “consequent support”, “support”, “confidence”, “lift”, “leverage”, “conviction” of all rules for which metric(rule) >= min_threshold. Each entry in the “antecedents” and “consequents” columns are of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).

Return type

DataFrame

get_members(snap, group)[source]

Return the ids of members inside a specific group of a specific snapshot.

Parameters
  • snap (str) – ID of snapshot.

  • group (str) – ID of group.

Raises

ValueError – ID not in IDs.

Returns

List of ids of members.

Return type

list

get_snapshot_evolution()[source]

Calculate Snapshot Evolution and return the similarity scores between subsequent snapshots.

Returns

Evolution of snapshot sequence.

Return type

list

get_snapshot_similarity(snap1, snap2, groups=False)[source]

Calculate Snapshot Similarity between 2 snapshots and return the matching groups with their corresponding similarity scores.

Parameters
  • snap1 (str) – ID of 1st snapshot

  • snap2 (str) – ID of 2nd snapshot

  • groups (bool, optional) – Whether return or not the groups of the matching, defaults to False

Raises

ValueError – If snapshot ID not snapshot IDs.

Returns

If groups=False, returns similarity of two snapshots. If groups=True, returns (sim, matching), i.e. detailed similarities of matching.

Return type

tuple

get_snapshots()[source]

Return the ids of snapshots.

Returns

List of ids of snapshots.

Return type

list

set_data(data, type, file=False)[source]

Pass data to initialize Change Detector. Only csv or json are allowed.

Parameters
  • data (str) – data given or filename to find data.

  • type (str) – type of data given, “json” or “csv”.

  • file (bool, optional) – whether file is given for input. Filename is stored in data, defaults to False

Returns

None

Return type

None

class loci.set_evolution.Graph[source]

Bases: object

Class for visualizing Change Detector methods.

group_evolution(cd, snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]

Create graph content for Group Evolution for given arguments.

Parameters
  • cd (class:Change_Detector) – Change_Detector object.

  • snap1 (str) – ID of Snapshot1.

  • group1 (str) – ID of Group1.

  • snap2 (str) – ID of Snapshot2.

  • tau (float) – Percentage collected, defaults to 0.8.

  • alpha (int) – Lower Bound for decision, defaults to 1.

  • beta (int) – Upper Bound for decision, defaults to 3.

Returns

None

Return type

None

group_provenance(cd, snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]

Create graph content for Group Provenance for given arguments.

Parameters
  • cd (class:Change_Detector) – Change_Detector object.

  • snap1 (str) – ID of Snapshot1.

  • group1 (str) – ID of Group1.

  • snap2 (str) – ID of Snapshot2.

  • tau (float) – Percentage collected, defaults to 0.8.

  • alpha (int) – Lower Bound for decision, defaults to 1.

  • beta (int) – Upper Bound for decision, defaults to 3.

Returns

None

Return type

None

group_similarity(cd, snap1, group1, snap2, group2)[source]

Create graph content for Group Similarity for given arguments.

Parameters
  • cd (class:Change_Detector) – Change_Detector object.

  • snap1 (str) – ID of Snapshot1.

  • group1 (str) – ID of Group1.

  • snap2 (str) – ID of Snapshot2.

  • group2 (str) – ID of Group2.

Returns

None

Return type

None

init_graph(cd)[source]

Create graph content.

Parameters

cd (class:Change_Detector) – Change_Detector object

Returns

None

Return type

None

member_evolution(cd, member)[source]

Create graph content for Member Evolution for given member.

Parameters
  • cd (class:Change_Detector) – Change_Detector object.

  • member (str) – ID of Member.

Returns

None

Return type

None

snapshot_evolution(cd)[source]

Create graph content for Snapshot Evolution.

Parameters

cd (class:Change_Detector) – Change_Detector object

Returns

None

Return type

None

snapshot_similarity(cd, snap1, snap2)[source]

Create graph content for Snapshot Similarity between given snapshots.

Parameters
  • cd (class:Change_Detector) – Change_Detector object.

  • snap1 (str) – ID of Snapshot1

  • snap2 (str) – ID of Snapshot2

Returns

None

Return type

None

loci.time_series module

exception loci.time_series.DictionarySizeIsNotSupported[source]

Bases: Exception

exception loci.time_series.OverlapSpecifiedIsNotSmallerThanWindowSize[source]

Bases: Exception

exception loci.time_series.StringsAreDifferentLength[source]

Bases: Exception

loci.time_series.change_detection_collective(ts_df_array, filenames, model, min_size, eps, min_samples, date_column, data_column)[source]

This method identifies the change points within a collection of time series, ranks them and distinguishes them between global and local changes. In order to find the changing points, our implementation uses the PELT approach (from the ruptures library) and calculates their rate change metric. Subsequently, using DBSCAN, it creates some clusters which include the change points that are part of the same global change.

Parameters
  • ts_df_array (list) – A list of Pandas dataframes containing the loaded time series.

  • filenames (list) – The corresponding filename of each time series.

  • model (string) – The desired PELT model (can be either “l1”, “l2”, “normal”, “rbf”, “linear”, or “ar”.

  • min_size (int) – Minimum number of samples between two change points (ruptures).

  • eps (int) – he maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster (DBSCAN).

  • min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (DBSCAN).

  • date_column (int) – The column number containing the datetime of each entry in each file.

  • data_column (int) – The column number containing the values in each file.

Returns

-final_data (:py:class:’dict’) - A dictionary containing the timestamps/dates of the identified change points, the name or id of the corresponding time series, the rate change of the change points, local-global cluster label (-1 stands for local changes). -data_clusters (:py:class:’dict’) - A dictionary containing all the identified clusters (global changes), their aggregate and absolute aggregate rate changes and corresponding cluster scores, their starting and ending date or timestamp and the number of members of each cluster.

loci.time_series.change_setection_single(ts_df, model, min_size)[source]

This method identifies the change points within a single time series using the PELT approach (i.e., from the ruptures library: https://centre-borelli.github.io/ruptures-docs/user-guide/detection/pelt/).

Parameters
  • ts_df (pandas.DataFrame) – A Pandas dataframe containing the loaded time series

  • model (string) – The desired PELT model (can be either “l1”, “l2”, “normal”, “rbf”, “linear”, or “ar”

  • min_size (int) – The minimum distance (in number of timestamps) between two consecutive change points.

loci.time_series.create_sankey(df_array, alphabet_size, word_size, begin, end, from_value, to_value)[source]

This method generates and returns an interactive, Plotly-based SankeyTS diagram. The bands and flows of the diagram are generated based on the SAX words of the loaded set of time series.

Parameters
  • df_array (list) – A list containing a Pandas dataframe for each read time series.

  • alphabet_size (int) – The alphabet size for SAX encoding.

  • word_size (int) – The SAX word length.

  • begin (datetime) – The starting date for the SankeyTS diagram.

  • end (datetime) – The ending date for the SankeyTS diagram.

  • from_value (float) – The starting value for the SankeyTS diagram.

  • to_value (float) – The ending value for the SankeyTS diagram.

Returns

A SankeyTS diagram.

Return type

plotly.graph_objects.figure

loci.time_series.read_file(my_file)[source]

Reads a single time series from the given file.

Parameters

my_path (string) – The file

Returns

A Pandas dataframe containing the read time series.

Return type

pandas.DataFrame

loci.time_series.read_files(my_path, date_column, data_column, date_format='%m/%d/%Y')[source]

Reads all the co-evolving time series files contained in the given path in a Pandas dataframe.

Parameters
  • my_path (string) – The path containing the time series files.

  • date_column (int) – The column number containing the datetime of each entry in each file.

  • data_column (int) – The column number containing the values in each file.

  • date_format (string) – The format of the date.

Returns

-df_array (:py:class:’list’) - A list containing a Pandas dataframe for each read time series. -filenames (:py:class:’list’) - A list containing all read filenames. -start_date (:py:class:’datetime’) - The starting datetime of all time series. -end-date (:py:class:’datetime’) - The ending datetime of all time series.

loci.time_series.seasonal_decomposition(ts_df, date_column, data_column, periods, m)[source]

This method performs the Triple Time Series Decomposition. The service takes as input a time series, the corresponding model type (“Multiplicative” or “Additive”), a list of periods parameters and the corresponding locale if applicable. The user can insert the path of the selected data or provide them in an array form. If the user provides more than 1 period parameters, the method selects the best one according to the best gain index. The provided time series is decomposed into three distinct components according to the selected model and period:

  • Trend: the increasing - decreasing value in the series.

  • Seasonality: the repeating short term cycle in the series.

  • Residual Error: the random variation in the series.

An additive model suggests that the components are added toghether as follows: - y(t) = Trend + Seasonality + Residual Error

While a multilicative model suggests that components are multiplied together as follows: - y(t) = Trend * Seasonality * Residual Error

This implementation uses the “statsmodels.tsa.seasonal.seasonal_decompose” from the statsmodels library.

Parameters
  • ts_df (pandas.DataFrame) – A Pandas dataframe containing the loaded time series.

  • date_column (int) – The column number containing the datetime of each entry in each file.

  • data_column (int) – The column number containing the values in each file.

  • periods (list) – A list containing the periods to be tested.

  • m (string) – The type of the model, ‘additive’ or ‘multiplicative’.

Returns

-result (:py:class:’statsmodels.tsa.seasonal.seasonal_decompose’) - The result of the seasonal decomposition. -best_period (:py:class:’json’) - The selected best period based on the minimum mean absolute value of the residual error component. -gain_indexes (:py:class:’json’) - A json with the {period:gain_index} for all the tested periods. -fig1 (:py:class:’plotly.express’) - A Potly figure containing the trend, seasonality and residual error components. -fig2 (:py:class:’plotly.express’) - A Potly figure containing the seasonality component.

loci.topics module

loci.topics.topic_modeling(clusters, label_col='cluster_id', kwds_col='kwds', num_of_topics=3, kwds_per_topic=10)[source]

Models clusters as documents, extracts topics, and assigns topics to clusters.

Parameters
  • clusters (GeoDataFrame) – A POI GeoDataFrame with assigned cluster labels.

  • label_col (string) – The name of the column containing the cluster labels (default: label).

  • kwds_col (string) – The name of the column containing the keywords of each POI (default: kwds).

  • num_of_topics (int) – The number of topics to extract (default: 3).

  • kwds_per_topic (int) – The number of keywords to return per topic (default: 10).

Returns

A DataFrame containing the clusters-to-topics assignments and a DataFrame containing the topics-to-keywords assignments.

Module contents