loci package
Submodules
loci.analytics module
- loci.analytics.bbox(gdf)[source]
Computes the bounding box of a GeoDataFrame.
- Parameters
gdf (GeoDataFrame) – A GeoDataFrame.
- Returns
A Polygon representing the bounding box enclosing all geometries in the GeoDataFrame.
- loci.analytics.filter_by_kwd(df, kwd_filter, col_kwds='kwds')[source]
Returns a DataFrame with only those rows that contain the specified keyword.
- Parameters
df (DataFrame) – The initial DataFrame to be filtered.
kwd_filter (string) – The keyword to use for filtering.
col_kwds (string) – Name of the column containing the keywords (default: kwds).
- Returns
A GeoDataFrame with only those rows that contain kwd_filter.
- loci.analytics.freq_locationsets(location_visits, location_id_col, locations, locationset_id_col, min_sup, min_length)[source]
Computes frequently visited sets of locations based on frequent itemset mining.
- Parameters
location_visits (DataFrame) – A DataFrame with location ids and locationset ids.
location_id_col (String) – The name of the column containing the location ids.
locationset_id_col (String) – The name of the column containing the locationsets ids.
locations (GeoDataFrame) – A GeoDataFrame containing the geometries of the locations.
min_sup (float) – The minimum support threshold.
min_length (int) – Minimum length of itemsets to be returned.
- Returns
A GeoDataFrame with the support, length and geometry of the computed location sets.
- loci.analytics.kwds_freq(gdf, col_kwds='kwds', normalized=False)[source]
Computes the frequency of keywords in the provided GeoDataFrame.
- Parameters
gdf (GeoDataFrame) – A GeoDataFrame with a keywords column.
col_kwds (string) – The column containing the list of keywords (default: kwds).
normalized (bool) – If True, the returned frequencies are normalized in [0,1] by dividing with the number of rows in gdf (default: False).
- Returns
A dictionary containing for each keyword the number of rows it appears in.
loci.brs module
- loci.brs.compute_score(init, params)[source]
Computes the score of a distribution.
- Parameters
init – A vector containing the values of the type distribution.
params – Configuration parameters.
- Returns
Computed score and relative entropy.
- loci.brs.create_graph(gdf, eps)[source]
Creates the spatial connectivity graph.
- Parameters
gdf – A GeoDataFrame containing the input points.
eps – The spatial distance threshold for edge creation.
- Returns
A NetworkX graph and an R-tree index over the points.
- loci.brs.expand_region(G, region_core, region_border, nodes_to_expand, params, types)[source]
Expands a given region by adding the given set of nodes.
- Parameters
G – The spatial connectivity graph over the input points.
region_core – The set of core points of the region.
region_border – The set of border points of the region.
nodes_to_expand – The set of points to be added.
params – The configuration parameters.
types – The set of distinct point types.
- Returns
The expanded region and its score.
- loci.brs.get_types(gdf)[source]
Extracts the types of points and assigns a random color to each type.
- Parameters
gdf – A GeoDataFrame containing the input points.
- Returns
Set of types and corresponding colors.
- loci.brs.init_queue(G, seeds, types, params, topk_regions, start_time, updates)[source]
Initializes the priority queue used for exploration.
- Parameters
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
types – The set of distinct point types.
params – The configuration parameters.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
A priority queue to drive the expansion process.
- loci.brs.pickSeeds(gdf, seeds_ratio)[source]
Selects seed points to be used by the ExpCircles algorithm.
- Parameters
gdf – A GeoDataFrame containing the input points.
seeds_ratio – Percentage of points to be used as seeds.
- Returns
Set of seed points.
- loci.brs.process_queue(G, queue, topk_regions, params, types, start_time, updates)[source]
Selects and expands the next region in the queue.
- Parameters
G – The spatial connectivity graph over the input points.
queue – A priority queue of candidate regions.
top_regions – A list to hold the top-k results.
params – The configuration parameters.
types – The set of distinct point types.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The new state after the expansion.
- loci.brs.run_exp_circles(gdf, rtree, G, seeds, params, types, topk_regions, start_time, updates)[source]
Executes the ExpCircles algorithm.
- Parameters
gdf – A GeoDataFrame containing the input points.
rtree – The R-tree index constructed over the input points.
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
params – The configuration parameters.
types – The set of distinct point types.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The list of top-k regions found within the given time budget.
- loci.brs.run_exp_hybrid(G, seeds, params, types, topk_regions, start_time, updates)[source]
Executes the ExpHybrid algorithm.
- Parameters
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
params – The configuration parameters.
types – The set of distinct point types.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The list of top-k regions found within the given time budget.
- loci.brs.run_mbrs(gdf, G, rtree, types, params, seeds)[source]
Computes the top-k high/low mixture regions.
- Parameters
gdf – A GeoDataFrame containing the input points.
G – The spatial connectivity graph over the input points.
rtree – The R-tree index constructed over the input points.
types – The set of distinct point types.
params – The configuration parameters.
seeds – The set of seeds to be used.
- Returns
The list of top-k regions found within the given time budget.
- loci.brs.show_map_topk_convex_regions(gdf, colors, topk_regions)[source]
Draws the convex hull around the points per region on the map. Each point is rendered with a color based on its type.
- Parameters
gdf – A GeoDataFrame containing the input points.
colors – A list containing the color corresponding to each type.
topk_regions – The list of top-k regions to be displayed.
- Returns
A map displaying the top-k regions.
loci.clustering module
- loci.clustering.cluster_shapes(pois, shape_type=1, eps_per_cluster=None)[source]
Computes cluster shapes.
- Parameters
pois (GeoDataFrame) – The clustered POIs.
shape_type (integer) – The methods to use for computing cluster shapes (allowed values: 1-3).
eps_per_cluster (DataFrame) – The value of parameter eps used for each cluster (required by methods 2 and 3).
- Returns
A GeoDataFrame containing the cluster shapes.
- loci.clustering.compute_clusters(pois, alg='dbscan', min_pts=None, eps=None, sample_size=- 1, kwd=None, n_jobs=1)[source]
Computes clusters using the DBSCAN or the HDBSCAN algorithm.
- Parameters
pois (GeoDataFrame) – A POI GeoDataFrame.
alg (string) – The clustering algorithm to use (dbscan or hdbscan; default: dbscan).
min_pts (integer) – The minimum number of neighbors for a dense point.
eps (float) – The neighborhood radius.
sample_size (int) – Sample size (default: -1; show all).
kwd (string) – A keyword to filter by (optional).
n_jobs (integer) – Number of parallel jobs to run in the algorithm (default: 1)
- Returns
A GeoDataFrame containing the clustered POIs and their labels. The value of parameter eps for each cluster is also returned (which varies in the case of HDBSCAN).
loci.index module
- loci.index.grid(pois, cell_width=None, cell_height=None, cell_size_ratio=0.01, znorm=False, neighborhood=False)[source]
Constructs a uniform grid from the given POIs.
If cell_width and cell_height are provided, each grid cell has size cell_width * cell_height. Otherwise, cell_width = cell_size_ratio * area_width and cell_height = cell_size_ratio * area_height, where area refers to the bounding box of pois.
Each cell is assigned a score, which is the number of points within that cell.
If neighborhood is True, each cell is assigned an additional score (score_nb), which is the total number of points within that cell and its adjacent cells.
If znorm is True, the above scores are also provided in their z-normalized variants, score_znorm and score_nb_znorm.
The constructed grid is represented by a GeoDataFrame where each row corresponds to a grid cell and contains the following columns:
cell_id: The id of the cell (integer computed as: cell_x * num_columns + cell_y)
cell_x: The row of the cell in the grid (integer).
cell_y: The column of the cell in the grid (integer).
score: see above
score_nb: see above
score_znorm: see above
score_nb_znorm: see above
‘contents’: list of points in the cell.
‘geometry’: Geometry column of the GeoDataFrame that contains the polygon representing the cell boundaries.
- Parameters
pois (GeoDataFrame) – a POIs GeoDataFrame.
cell_width (float) – cell width.
cell_height (float) – cell height.
cell_size_ratio (float) – ratio of cell width and height to area width and height (default: 0.01).
znorm (bool) – Whether to include z-normalized scores (default: False).
neighborhood (bool) – Whether to include a total score including adjacent cells (default: False).
- Returns
A GeoDataFrame as described above.
loci.io module
- loci.io.crop(gdf, min_lon, min_lat, max_lon, max_lat)[source]
Crops the given GeoDataFrame according to the given bounding box.
- Parameters
gdf (GeoDataFrame) – The original GeoDataFrame.
min_lon (floats) – The bounds.
min_lat (floats) – The bounds.
max_lon (floats) – The bounds.
max_lat (floats) – The bounds.
- Returns
The cropped GeoDataFrame.
- loci.io.import_osmnx(bound, target_crs='EPSG:4326')[source]
Creates a POI GeoDataFrame from POIs retrieved by OSMNX (https://github.com/gboeing/osmnx).
- Parameters
bound (polygon) – A polygon to be used as filter.
target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).
- Returns
A POI GeoDataFrame with columns id, name and kwds.
- loci.io.import_osmwrangle(osmwrangle_file, target_crs='EPSG:4326', bound=None)[source]
Creates a POI GeoDataFrame from a file produced by OSMWrangle (https://github.com/SLIPO-EU/OSMWrangle).
- Parameters
osmwrangle_file (string) – Path or URL to the input csv file.
target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).
bound (polygon) – A polygon to be used as filter.
- Returns
A POI GeoDataFrame with columns id, name and kwds.
- loci.io.read_csv(input_file, sep=',', col_id='id', col_name='name', col_lon='lon', col_lat='lat', col_kwds='keywords', kwds_sep=';', source_crs='EPSG:4326', target_crs='EPSG:4326')[source]
Create a DataFrame from a CSV file and then convert to GeoDataFrame.
- Parameters
input_file (string) – Path to the input CSV file.
sep (string) – Column delimiter (default: ;).
col_id (string) – Name of the column containing the id (default: id).
col_name (string) – Name of the column containing the name (default: name).
col_lon (string) – Name of the column containing the longitude (default: lon).
col_lat (string) – Name of the column containing the latitude (default: lat).
col_kwds (string) – Name of the column containing the keywords (default: kwds).
kwds_sep (string) – Keywords delimiter (default: ;).
source_crs (string) – Coordinate Reference System of input data (default: EPSG:4326).
target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).
- Returns
A GeoDataFrame.
- loci.io.read_poi_csv(input_file, col_id='id', col_name='name', col_lon='lon', col_lat='lat', col_kwds='kwds', col_sep=';', kwds_sep=',', source_crs='EPSG:4326', target_crs='EPSG:4326', keep_other_cols=False)[source]
Creates a POI GeoDataFrame from an input CSV file.
- Parameters
input_file (string) – Path to the input csv file.
col_id (string) – Name of the column containing the POI id (default: id).
col_name (string) – Name of the column containing the POI name (default: name).
col_lon (string) – Name of the column containing the POI longitude (default: lon).
col_lat (string) – Name of the column containing the POI latitude (default: lat).
col_kwds (string) – Name of the column containing the POI keywords (default: kwds).
col_sep (string) – Column delimiter (default: ;).
kwds_sep (string) – Keywords delimiter (default: ,).
source_crs (string) – Coordinate Reference System of input data (default: EPSG:4326).
target_crs (string) – Coordinate Reference System of the GeoDataFrame to be created (default: EPSG:4326).
keep_other_cols (bool) – Whether to keep the rest of the columns in the csv file (default: False).
- Returns
A POI GeoDataFrame with columns id, name and kwds.
loci.mbrs module
- loci.mbrs.check_cohesiveness(gdf, p, region, eps)[source]
Checks if point p is within distance eps from at least one of the points in the region.
- Parameters
gdf – A GeoDataFrame containing the input points.
p – Location of the point to examine.
region – A list with the the identifiers of the points currently in the region.
eps – The distance threshold.
- Returns
A Boolean value.
- loci.mbrs.compute_score(init, region_size, params)[source]
Computes the score of a distribution.
- Parameters
init – A vector containing the values of the type distribution.
region_size – The number of points that constitute the region.
params – Configuration parameters.
- Returns
Computed score and relative entropy.
- loci.mbrs.create_graph(gdf, eps)[source]
Creates the spatial connectivity graph.
- Parameters
gdf – A GeoDataFrame containing the input points.
eps – The spatial distance threshold for edge creation.
- Returns
A NetworkX graph and an R-tree index over the points.
- loci.mbrs.expand_region(G, region_core, region_border, nodes_to_expand, params, types)[source]
Expands a given region by adding the given set of nodes.
- Parameters
G – The spatial connectivity graph over the input points.
region_core – The set of core points of the region.
region_border – The set of border points of the region.
nodes_to_expand – The set of points to be added.
params – The configuration parameters.
types – The set of distinct point types.
- Returns
The expanded region and its score.
- loci.mbrs.expand_region_with_neighbors(G, region)[source]
Expands a given region with its neighboring nodes according to the graph.
- Parameters
G – The spatial connectivity graph over the input points.
region – The set of points currently in the region.
- Returns
The expanded region.
- loci.mbrs.get_region_score(G, types, region, params)[source]
Computes the score of the given region according to the connectivity graph.
- Parameters
G – The spatial connectivity graph over the input points.
types – The set of distinct point types.
region – The set of points in the region.
params – The configuration parameters.
- Returns
The score of the region, its relative entropy, and a vector with the values of POI type distribution .
- loci.mbrs.get_types(gdf)[source]
Extracts the types of points and assigns a random color to each type.
- Parameters
gdf – A GeoDataFrame containing the input points.
- Returns
Set of types and corresponding colors.
- loci.mbrs.init_queue(G, seeds, types, params, topk_regions, start_time, updates)[source]
Initializes the priority queue used for exploration.
- Parameters
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
types – The set of distinct point types.
params – The configuration parameters.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
A priority queue to drive the expansion process.
- loci.mbrs.partition_data_in_grid(gdf, cell_size)[source]
Partitions a GeoDataFrame of points into a uniform grid of square cells.
- Parameters
gdf – A GeoDataFrame containing the input points.
cell_size – The size of the square cell (same units as the coordinates in the input data).
- Returns
An R-tree index over the input points; also, a GeoDataFrame representing the centroids of the non-empty cells of the grid.
- loci.mbrs.pick_seeds(gdf, seeds_ratio)[source]
Selects seed points to be used by the ExpCircles algorithm.
- Parameters
gdf – A GeoDataFrame containing the input points.
seeds_ratio – Percentage of points to be used as seeds.
- Returns
Set of seed points.
- loci.mbrs.process_queue(G, queue, topk_regions, params, types, start_time, updates)[source]
Selects and expands the next region in the queue.
- Parameters
G – The spatial connectivity graph over the input points.
queue – A priority queue of candidate regions.
top_regions – A list to hold the top-k results.
params – The configuration parameters.
types – The set of distinct point types.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The new state after the expansion.
- loci.mbrs.run(gdf, G, rtree, types, params, eps)[source]
Computes the top-k high/low mixture regions.
- Parameters
gdf – A GeoDataFrame containing the input points.
G – The spatial connectivity graph over the input points.
rtree – The R-tree index constructed over the input points.
types – The set of distinct point types.
params – The configuration parameters.
eps – The distance threshold.
- Returns
The list of top-k regions detected within the given time budget.
- loci.mbrs.run_exp_circles(gdf, rtree, G, seeds, params, eps, types, topk_regions, start_time, updates)[source]
Executes the ExpCircles algorithm. Employes a priority queue of seeds and expands search in circles of increasing radii around each seed.
- Parameters
gdf – A GeoDataFrame containing the input points.
rtree – The R-tree index constructed over the input points.
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
params – The configuration parameters.
eps – The distance threshold.
types – The set of distinct point types.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The list of top-k regions found within the given time budget.
- loci.mbrs.run_exp_hybrid(G, seeds, params, types, topk_regions, start_time, updates)[source]
Executes the ExpHybrid algorithm.
- Parameters
G – The spatial connectivity graph over the input points.
seeds – The set of seeds to be used.
params – The configuration parameters.
types – The set of distinct point types.
top_regions – A list to hold the top-k results.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The list of top-k regions found within the given time budget.
- loci.mbrs.show_map(gdf, region, colors)[source]
Draws the points belonging to a single region on the map. Each point is rendered with a color based on its type.
- Parameters
gdf – A GeoDataFrame containing the input points.
region – The region to be displayed, i.e., a list of the identifiers of its constituent points.
colors – A list containing the color corresponding to each type.
- Returns
A map displaying the top-k regions.
- loci.mbrs.show_map_topk_convex_regions(gdf, colors, topk_regions)[source]
Draws the convex hull around the points per region on the map. Each point is rendered with a color based on its type.
- Parameters
gdf – A GeoDataFrame containing the input points.
colors – A list containing the color corresponding to each type.
topk_regions – The list of top-k regions to be displayed.
- Returns
A map displaying the top-k regions.
- loci.mbrs.show_map_topk_grid_regions(gdf, prtree, colors, gdf_grid, cell_size, topk_regions)[source]
Draws the points per grid-based region on the map. Each point is rendered with a color based on its type.
- Parameters
gdf – A GeoDataFrame containing the input points.
prtree – The R-tree index already constructed over the input points.
colors – A list containing the color corresponding to each type.
gdf_grid – The grid partitioning (cell centroids with their POI types) created over the input points.
cell_size – The size of the square cell in the applied grid partitioning (user-specified distance threshold eps).
topk_regions – The list of top-k grid-based regions to be displayed.
- Returns
A map displaying the top-k regions along with the grid cells constituting each region.
- loci.mbrs.update_topk_list(topk_regions, region_core, region_border, rel_se, score, init, params, start_time, updates)[source]
Checks and updates the list of top-k region with a candidate region.
- Parameters
topk_regions – The current list of top-k best regions.
region_core – The set of core points of the candidate region.
region_border – The set of border points of the candidate region.
rel_se – The relative entropy of the candidate region.
score – The score of the candidate region.
init – A vector containing the values of the type distribution of points the candidate region.
params – The configuration parameters.
start_time – The starting time of the execution.
updates – A structure to hold update times of new results.
- Returns
The updated list of the top-k best regions.
loci.plots module
- loci.plots.barchart(data, orientation='Vertical', x_axis_label='', y_axis_label='', plot_title='', bar_width=0.5, plot_width=15, plot_height=5, top_k=10)[source]
Plots a bar chart with the given data.
- Parameters
data (dict) – The data to plot.
orientation (string) – The orientation of the bars in the plot (Vertical or Horizontal; default: Vertical).
x_axis_label (string) – Label of x axis.
y_axis_label (string) – Label of y axis.
plot_title (string) – Title of the plot.
bar_width (scalar) – The width of the bars (default: 0.5).
plot_width (scalar) – The width of the plot (default: 15).
plot_height (scalar) – The height of the plot (default: 5).
top_k (integer) – Top k results (if -1, show all; default: 10).
- Returns
A Matplotlib plot displaying the bar chart.
- loci.plots.heatmap(pois, sample_size=- 1, kwd=None, tiles='OpenStreetMap', width='100%', height='100%', radius=10)[source]
Generates a heatmap of the input POIs.
- Parameters
pois (GeoDataFrame) – A POIs GeoDataFrame.
sample_size (int) – Sample size (default: -1; show all).
kwd (string) – A keyword to filter by (optional).
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
radius (float) – Radius of each point of the heatmap (default: 10).
- Returns
A Folium Map object displaying the heatmap generated from the POIs.
- loci.plots.map_choropleth(areas, id_field, value_field, fill_color='YlOrRd', fill_opacity=0.6, num_bins=5, tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map showing the clusters. Map center and zoom level are set automatically.
- Parameters
areas (GeoDataFrame) – A GeoDataFrame containing the areas to be displayed.
id_field (string) – The name of the column to use as id.
value_field (string) – The name of the column indicating the area’s value.
fill_color (string) – A string indicating a Matplotlib colormap (default: YlOrRd).
fill_opacity (float) – Opacity level (default: 0.6).
num_bins (int) – The number of bins for the threshold scale (default: 5).
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the given clusters.
- loci.plots.map_cluster_contents_osm(cluster_borders, tiles='OpenStreetMap', width='100%', height='100%')[source]
Constructs a Folium Map displaying the streets and buildings, retreived from OpenStreetMap via OSMNX, within a given AOI.
- Parameters
cluster_borders (GeoDataFrame) – The cluster polygons.
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the retreived entities.
- loci.plots.map_cluster_diff(clusters_a, clusters_b, intersection_color='#00ff00', diff_ab_color='#0000ff', diff_ba_color='#ff0000', tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map displaying the differences between two sets of clusters. Map center and zoom level are set automatically.
- Parameters
clusters_a (GeoDataFrame) – The first set of clusters.
clusters_b (GeoDataFrame) – The second set of clusters.
intersection_color (color code) – The color to use for A & B.
diff_ab_color (color code) – The color to use for A - B.
diff_ba_color (color code) – The color to use for B - A.
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying cluster intersections and differences.
- loci.plots.map_clusters_with_topics(clusters_topics, viz_type='dominant', col_id='cluster_id', col_dominant='Dominant Topic', colormap='tab10', red='Topic0', green='Topic1', blue='Topic2', single_topic='Topic0', tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map showing the clusters colored based on their topics.
- Parameters
clusters_topics (GeoDataFrame) – A GeoDataFrame containing the clusters to be displayed and their topics.
viz_type (string) – Indicates how to assign colors based on topics. One of: ‘dominant’, ‘single’, ‘rgb’.
col_id (string) – The name of the column indicating the cluster id (default: cluster_id).
col_dominant (string) – The name of the column indicating the dominant topic (default: Dominant Topic).
colormap (string) – A string indicating a Matplotlib colormap (default: tab10).
red (string) – The name of the column indicating the topic to assign to red (default: Topic0).
green (string) – The name of the column indicating the topic to assign to green (default: Topic1).
blue (string) – The name of the column indicating the topic to assign to blue (default: Topic2).
single_topic (string) – The name of the column indicating the topic to use (default: Topic0).
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the given clusters colored by their topics.
- loci.plots.map_geometries(gdf, tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map displaying the provided geometries. Map center and zoom level are set automatically.
- Parameters
gdf (GeoDataFrame) – A GeoDataFrame containing the geometries to be displayed.
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the given geometries.
- loci.plots.map_geometry(geom, tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map displaying the provided geometry. Map center and zoom level are set automatically.
- Parameters
geom (Shapely Geometry) – A geometry to be displayed.
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the given geometry.
- loci.plots.map_points(pois, sample_size=- 1, kwd=None, show_bbox=False, tiles='OpenStreetMap', width='100%', height='100%')[source]
Returns a Folium Map displaying the provided points. Map center and zoom level are set automatically.
- Parameters
pois (GeoDataFrame) – A GeoDataFrame containing the POIs to be displayed.
sample_size (int) – Sample size (default: -1; show all).
kwd (string) – A keyword to filter by (optional).
show_bbox (bool) – Whether to show the bounding box of the GeoDataFrame (default: False).
tiles (string) – The tiles to use for the map (default: OpenStreetMap).
width (integer or percentage) – Width of the map in pixels or percentage (default: 100%).
height (integer or percentage) – Height of the map in pixels or percentage (default: 100%).
- Returns
A Folium Map object displaying the given POIs.
- loci.plots.plot_wordcloud(pois, bg_color='black', width=400, height=200)[source]
Generates and plots a word cloud from the keywords of the given POIs.
- Parameters
pois (GeoDataFrame) – The POIs from which the keywords will be used to generate the word cloud.
bg_color (string) – The background color to use for the plot (default: black).
width (int) – The width of the plot.
height (int) – The height of the plot.
loci.set_evolution module
- class loci.set_evolution.Change_Detector[source]
Bases:
object
Change Detector class for studying evolving sets.
- Parameters
similarities (dict) – map for caching group similarities, in the form of (group1, group2) -> sim
groups (dict) – 3d map, in the form of snapshot->group->member
matchings (dict) – map for caching snapshot similarities, in the form of (snapshot1, snapshot2) -> (group1, group2, sim)
inv_index (dict) – 3d map, in the form of member->snapshot->group
- get_group_evolution(snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]
Calculate Group Evolution for a specific group and return how its members have been distributed into another snapshot. The status for the group is also labeled with a number: 0-> Similar, 1->Split, 2-> Diffused, which is determined by how many groups contain the fraction tau and whether it lies between alpha and beta.
- Parameters
snap1 (str) – ID of 1st snapshot.
group1 (str) – ID of 1st group.
snap2 (str) – ID of 2nd snapshot.
tau (float, optional) – Percentage collected, defaults to 0.8
alpha (int, optional) – Lower Bound for decision, defaults to 1
beta (int, optional) – Upper Bound for decision, defaults to 3
- Raises
ValueError – ID not in IDs.
- Returns
(status, related_groups): Status, i.e 0-> Similar, 1->Split, 2-> Diffused. Groups, percentage in each group.
- Return type
tuple
- get_group_provenance(snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]
Calculate Group Provenance for a specific group and return how its members have been originated from another snapshot. The status for the group is also labeled with a number: 0-> Similar, 1->Merged, 2-> Accumulated, which is determined by how many groups contain the fraction tau and whether it lies between alpha and beta.
- Parameters
snap1 (str) – ID of 1st snapshot.
group1 (str) – ID of 1st group.
snap2 (str) – ID of 2nd snapshot.
tau (float, optional) – Percentage collected, defaults to 0.8
alpha (int, optional) – Lower Bound for decision, defaults to 1
beta (int, optional) – Upper Bound for decision, defaults to 3
- Raises
ValueError – ID not in IDs.
- Returns
(status, related_groups): Status, i.e 0-> Similar, 1-> Merged, 2-> Accumulated. Groups, percentage in each group.
- Return type
tuple
- get_group_similarity(snap1, group1, snap2, group2)[source]
Calculate Similarity between 2 groups (Jaccard).
- Parameters
snap1 (str) – ID of 1st snapshot.
group1 (str) – ID of 1st group.
snap2 (str) – ID of 2nd snapshot.
group2 (str) – ID of 2nd group.
- Raises
ValueError – ID not in list of IDs.
- Returns
Similarity of the two groups (jaccard).
- Return type
int
- get_groups(snap)[source]
Return the ids of groups inside a specific snapshot.
- Parameters
snap (str) – ID of snapshot.
- Raises
ValueError – ID not in snapshot IDs.
- Returns
List of ids of groups.
- Return type
list
- get_groups_of_member(member)[source]
Return the ids of (snapshot,group) that a member belongs to.
- Parameters
member (str) – ID of member.
- Raises
ValueError – Member ID not in IDs.
- Returns
List of ids of (snapshot,group).
- Return type
list
- get_member_comembers(member, snap)[source]
Return the ids of members in the same group as member in the specific snapshot.
- Parameters
member (str) – ID of member.
snap – ID of snapshot.
- Raises
ValueError – ID not in IDs.
- Returns
List of ids of members.
- Return type
list
- get_member_evolution(member)[source]
Calculate Member Evolution and return the similarity scores from groups between subsequent snapshots, that the member belongs to.
- Parameters
member (str) – ID of member.
- Raises
ValueError – Member ID not in member IDs.
- Returns
Evolution of member.
- Return type
list
- get_member_rules(member, min_support=0.07, min_threshold=1, metric='lift')[source]
Return rules from frequent subgroups mining for specific member.
- Parameters
member (list, optional) – Member ID to use for group filtering.
min_support (float, optional) – A float between 0 and 1 for minumum support of the itemsets returned, defaults to 0.07
min_threshold (float, optional) – Minimal threshold for the evaluation metric, via the metric parameter, to decide whether a candidate rule is of interest, defaults to 1
metric (str, optional) – Metric to use: ‘lift’, ‘support’, ‘confidence’, ‘leverage’, ‘conviction’, defaults to ‘lift’
- Returns
pandas DataFrame with columns “antecedents” and “consequents” that store itemsets, plus the scoring metric columns: “antecedent support”, “consequent support”, “support”, “confidence”, “lift”, “leverage”, “conviction” of all rules for which metric(rule) >= min_threshold. Each entry in the “antecedents” and “consequents” columns are of type frozenset, which is a Python built-in type that behaves similarly to sets except that it is immutable (For more info, see https://docs.python.org/3.6/library/stdtypes.html#frozenset).
- Return type
DataFrame
- get_members(snap, group)[source]
Return the ids of members inside a specific group of a specific snapshot.
- Parameters
snap (str) – ID of snapshot.
group (str) – ID of group.
- Raises
ValueError – ID not in IDs.
- Returns
List of ids of members.
- Return type
list
- get_snapshot_evolution()[source]
Calculate Snapshot Evolution and return the similarity scores between subsequent snapshots.
- Returns
Evolution of snapshot sequence.
- Return type
list
- get_snapshot_similarity(snap1, snap2, groups=False)[source]
Calculate Snapshot Similarity between 2 snapshots and return the matching groups with their corresponding similarity scores.
- Parameters
snap1 (str) – ID of 1st snapshot
snap2 (str) – ID of 2nd snapshot
groups (bool, optional) – Whether return or not the groups of the matching, defaults to False
- Raises
ValueError – If snapshot ID not snapshot IDs.
- Returns
If groups=False, returns similarity of two snapshots. If groups=True, returns (sim, matching), i.e. detailed similarities of matching.
- Return type
tuple
- get_snapshots()[source]
Return the ids of snapshots.
- Returns
List of ids of snapshots.
- Return type
list
- set_data(data, type, file=False)[source]
Pass data to initialize Change Detector. Only csv or json are allowed.
- Parameters
data (str) – data given or filename to find data.
type (str) – type of data given, “json” or “csv”.
file (bool, optional) – whether file is given for input. Filename is stored in data, defaults to False
- Returns
None
- Return type
None
- class loci.set_evolution.Graph[source]
Bases:
object
Class for visualizing Change Detector methods.
- group_evolution(cd, snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]
Create graph content for Group Evolution for given arguments.
- Parameters
cd (class:Change_Detector) – Change_Detector object.
snap1 (str) – ID of Snapshot1.
group1 (str) – ID of Group1.
snap2 (str) – ID of Snapshot2.
tau (float) – Percentage collected, defaults to 0.8.
alpha (int) – Lower Bound for decision, defaults to 1.
beta (int) – Upper Bound for decision, defaults to 3.
- Returns
None
- Return type
None
- group_provenance(cd, snap1, group1, snap2, tau=0.8, alpha=1, beta=3)[source]
Create graph content for Group Provenance for given arguments.
- Parameters
cd (class:Change_Detector) – Change_Detector object.
snap1 (str) – ID of Snapshot1.
group1 (str) – ID of Group1.
snap2 (str) – ID of Snapshot2.
tau (float) – Percentage collected, defaults to 0.8.
alpha (int) – Lower Bound for decision, defaults to 1.
beta (int) – Upper Bound for decision, defaults to 3.
- Returns
None
- Return type
None
- group_similarity(cd, snap1, group1, snap2, group2)[source]
Create graph content for Group Similarity for given arguments.
- Parameters
cd (class:Change_Detector) – Change_Detector object.
snap1 (str) – ID of Snapshot1.
group1 (str) – ID of Group1.
snap2 (str) – ID of Snapshot2.
group2 (str) – ID of Group2.
- Returns
None
- Return type
None
- init_graph(cd)[source]
Create graph content.
- Parameters
cd (class:Change_Detector) – Change_Detector object
- Returns
None
- Return type
None
- member_evolution(cd, member)[source]
Create graph content for Member Evolution for given member.
- Parameters
cd (class:Change_Detector) – Change_Detector object.
member (str) – ID of Member.
- Returns
None
- Return type
None
loci.time_series module
- loci.time_series.change_detection_collective(ts_df_array, filenames, model, min_size, eps, min_samples, date_column, data_column)[source]
This method identifies the change points within a collection of time series, ranks them and distinguishes them between global and local changes. In order to find the changing points, our implementation uses the PELT approach (from the ruptures library) and calculates their rate change metric. Subsequently, using DBSCAN, it creates some clusters which include the change points that are part of the same global change.
- Parameters
ts_df_array (list) – A list of Pandas dataframes containing the loaded time series.
filenames (list) – The corresponding filename of each time series.
model (string) – The desired PELT model (can be either “l1”, “l2”, “normal”, “rbf”, “linear”, or “ar”.
min_size (int) – Minimum number of samples between two change points (ruptures).
eps (int) – he maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster (DBSCAN).
min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (DBSCAN).
date_column (int) – The column number containing the datetime of each entry in each file.
data_column (int) – The column number containing the values in each file.
- Returns
-final_data (:py:class:’dict’) - A dictionary containing the timestamps/dates of the identified change points, the name or id of the corresponding time series, the rate change of the change points, local-global cluster label (-1 stands for local changes). -data_clusters (:py:class:’dict’) - A dictionary containing all the identified clusters (global changes), their aggregate and absolute aggregate rate changes and corresponding cluster scores, their starting and ending date or timestamp and the number of members of each cluster.
- loci.time_series.change_setection_single(ts_df, model, min_size)[source]
This method identifies the change points within a single time series using the PELT approach (i.e., from the ruptures library: https://centre-borelli.github.io/ruptures-docs/user-guide/detection/pelt/).
- Parameters
ts_df (pandas.DataFrame) – A Pandas dataframe containing the loaded time series
model (string) – The desired PELT model (can be either “l1”, “l2”, “normal”, “rbf”, “linear”, or “ar”
min_size (int) – The minimum distance (in number of timestamps) between two consecutive change points.
- loci.time_series.create_sankey(df_array, alphabet_size, word_size, begin, end, from_value, to_value)[source]
This method generates and returns an interactive, Plotly-based SankeyTS diagram. The bands and flows of the diagram are generated based on the SAX words of the loaded set of time series.
- Parameters
df_array (list) – A list containing a Pandas dataframe for each read time series.
alphabet_size (int) – The alphabet size for SAX encoding.
word_size (int) – The SAX word length.
begin (datetime) – The starting date for the SankeyTS diagram.
end (datetime) – The ending date for the SankeyTS diagram.
from_value (float) – The starting value for the SankeyTS diagram.
to_value (float) – The ending value for the SankeyTS diagram.
- Returns
A SankeyTS diagram.
- Return type
plotly.graph_objects.figure
- loci.time_series.read_file(my_file)[source]
Reads a single time series from the given file.
- Parameters
my_path (string) – The file
- Returns
A Pandas dataframe containing the read time series.
- Return type
pandas.DataFrame
- loci.time_series.read_files(my_path, date_column, data_column, date_format='%m/%d/%Y')[source]
Reads all the co-evolving time series files contained in the given path in a Pandas dataframe.
- Parameters
my_path (string) – The path containing the time series files.
date_column (int) – The column number containing the datetime of each entry in each file.
data_column (int) – The column number containing the values in each file.
date_format (string) – The format of the date.
- Returns
-df_array (:py:class:’list’) - A list containing a Pandas dataframe for each read time series. -filenames (:py:class:’list’) - A list containing all read filenames. -start_date (:py:class:’datetime’) - The starting datetime of all time series. -end-date (:py:class:’datetime’) - The ending datetime of all time series.
- loci.time_series.seasonal_decomposition(ts_df, date_column, data_column, periods, m)[source]
This method performs the Triple Time Series Decomposition. The service takes as input a time series, the corresponding model type (“Multiplicative” or “Additive”), a list of periods parameters and the corresponding locale if applicable. The user can insert the path of the selected data or provide them in an array form. If the user provides more than 1 period parameters, the method selects the best one according to the best gain index. The provided time series is decomposed into three distinct components according to the selected model and period:
Trend: the increasing - decreasing value in the series.
Seasonality: the repeating short term cycle in the series.
Residual Error: the random variation in the series.
An additive model suggests that the components are added toghether as follows: - y(t) = Trend + Seasonality + Residual Error
While a multilicative model suggests that components are multiplied together as follows: - y(t) = Trend * Seasonality * Residual Error
This implementation uses the “statsmodels.tsa.seasonal.seasonal_decompose” from the statsmodels library.
- Parameters
ts_df (pandas.DataFrame) – A Pandas dataframe containing the loaded time series.
date_column (int) – The column number containing the datetime of each entry in each file.
data_column (int) – The column number containing the values in each file.
periods (list) – A list containing the periods to be tested.
m (string) – The type of the model, ‘additive’ or ‘multiplicative’.
- Returns
-result (:py:class:’statsmodels.tsa.seasonal.seasonal_decompose’) - The result of the seasonal decomposition. -best_period (:py:class:’json’) - The selected best period based on the minimum mean absolute value of the residual error component. -gain_indexes (:py:class:’json’) - A json with the {period:gain_index} for all the tested periods. -fig1 (:py:class:’plotly.express’) - A Potly figure containing the trend, seasonality and residual error components. -fig2 (:py:class:’plotly.express’) - A Potly figure containing the seasonality component.
loci.topics module
- loci.topics.topic_modeling(clusters, label_col='cluster_id', kwds_col='kwds', num_of_topics=3, kwds_per_topic=10)[source]
Models clusters as documents, extracts topics, and assigns topics to clusters.
- Parameters
clusters (GeoDataFrame) – A POI GeoDataFrame with assigned cluster labels.
label_col (string) – The name of the column containing the cluster labels (default: label).
kwds_col (string) – The name of the column containing the keywords of each POI (default: kwds).
num_of_topics (int) – The number of topics to extract (default: 3).
kwds_per_topic (int) – The number of keywords to return per topic (default: 10).
- Returns
A DataFrame containing the clusters-to-topics assignments and a DataFrame containing the topics-to-keywords assignments.