A Range Query Scheme for Spatial Data with Shuffled Differential Privacy (2024)

1. Introduction

With the development of the era of big data, massive amounts of spatial data are generated, inwhich the user’s sensitive location information may be leaked during the collection and publication by third-party servers. Forexample, when the server collects and publishes the user’s location information, it may be attacked by an attacker or learn the user’s entertainment, shopping, social and other daily behavior patterns such that the user’s personal privacy is leaked. Recently, theTelegram robot was suspected to have leaked 4.5 billion pieces of express information, including the user’s name, telephone and home address [1]. Therefore, it is necessary to protect the user’s spatial data privacy when publishing spatialdata.

Differential privacy [2] protection was proposed by Dworketal. in 2006 to protect data privacy. Central differential privacy [3] adds noise to the statistical results of the data. This assumes that the third-party servers are fully trusted and have high availability, butit* privacy protection is low compared with other technologies. Inpractical applications, it is difficult to find fully trusted third-party servers, and thus, local differential privacy [4] came into being. Users send data to the third-party servers after local perturbation. Compared with the central differential privacy, thedegree of privacy protection is higher. Theshuffled differential privacy [5] is between the two privacy protections. Theintroduction of a shuffler shuffles the correspondence between the data and user ID, which can improve the data availability with the same intensity of privacy protection as local differentialprivacy.

For high-dimensional spatial data, this mainly refers to the dimension greater than two, including user location information, attribute data, relationship information and time update information. Similar to relational data [6], thehigh-dimensional spatial data will also face the same “curse of dimensionality” [7], that is, theincrease in the communication cost due to the distribution and scale of data points and the reduction in signal-to-noise ratio bring about the reduction in data availability. Therefore, it is necessary to construct an efficient index structure for dimensionality reduction. A spatial index structure, such as a UB-tree, is a combination of a Z-order and $B^{*}$ data index structure proposed by Bayeretal. [8]. Lietal. [9] proposed an adaptively clocking region division to resist the threat under a location-based service. However, theactual spatial location data will have different distribution densities. Different from the previous work under the B-tree, Wangetal. [10] proposed a new domain decomposition mechanism of privNUD with local differential privacy, butit performs an adaptive range query on one-dimensionaldata.

At present, thework of range queries with differential privacy protection mainly focuses on two-dimensional or one-dimensional spatial data. Cormodeetal. [11] proposed private spatial decompositions (PSDs) based on central differential privacy, which divides spatial data, reports the statistical number of points in the region andthen answers the range query. Zhangetal. [12] proposed PrivTree, which uses the quadtree for hierarchical decomposition and introduces a bias to weaken the influence of noise. Different from the above spatial range query work with central differential privacy, Kimetal. [13] used local differential privacy in indoor positioning systems for the first time to estimate the density of the region. Kulkarnietal. [14] proposed and analyzed different methods to answer one-dimensional range queries with local differential privacy. Both of these works have the problem that the direct application of the schemes to multi-dimensional scenarios will reduce the statistical accuracy becausethe density of the data is not considered. Duetal. [15] proposed an adaptive hierarchical decomposition protocol (AHEAD), which can control the added noise by adaptively dynamically controlling the established tree structure. Ahujaetal. [16] designed a method to provide accurate queries based on variational auto-encoders. Thescheme improves the data availability for range queries in multi-dimensional scenarios, which inevitably leads to a reduction in privacy protection intensity. Theabove are the works on range queries with local or central differentialprivacy.

Compared with the previous range query schemes with local differential privacy, theresearch work of shuffled differential privacy is currently focused on frequency and mean statistics. Theshuffled differential privacy was proposed by Bittauetal. [17] in 2017, which achieved a good tradeoff between the degree of privacy protection and data availability. Balleetal. [5] proposed the real-sum protocol and privacy amplification effect in the shuffled model. Wangetal. [18] studied the shuffle model from two aspects of algorithm and model security, andproposed a more secure PEOS model. Cheuetal. [19] presented a protocol to reduce the message complexity of histograms. We compared the range queries containing the $_{d p} K D$ (KDRQ) scheme with related work by other researchers, as shown in Table 1. These works do not involve range queries oriented to spatial data, and thus, we used the index structure to perform range queries on high-dimensional spatial data with shuffled differential privacy. Thescheme was directly applied to the shuffled differential privacy model, andthe data distribution law was not retained, which led to the problem of low queryaccuracy.

With shuffled differential privacy, we designed a new index structure to improve the query accuracy andcould achieve adaptive range queries. Compared with the existing indexing schemes for multi-dimensional spatial data, we constructed an index structure that supports range queries for high-dimensional spatial data, which is more in line with the distribution characteristics of data and can support fast high-dimensional range queries. Thespecific contributions are as follows:

A secure range query scheme for high-dimensional geographic spatial data with shuffled differential privacy is proposed. It solves the problem of low query accuracy and low availability resulting from the direct extension of local differential privacy to shuffled differential privacy scenario.
Building an index that supports high-precision range queries. Theindex can adaptively divide the spatial region and improve the accuracy of range queries about spatial data. Thealgorithm can also be used to solve other range query problems with shuffled differential privacy.
Security analysis showed that our KDRQ scheme satisfied the requirements for shuffled differential privacy. Experimental analysis on the Landmark and Checkin datasets showed that the KDRQ method was superior to the existing methods in terms of range query accuracy.

The organization of this paper is as indicated below. InSection 2, we introduce the related work. InSection 3, we introduce the preliminaries used in this paper. InSection 4, we present the system model of the scheme. Then, Section 5 provides a detailed introduction to the scheme we have proposed. InSection 6, we present experimental results to show the query accuracy of KDRQ. Finally, we summarize this paper in Section 7.

2. RelatedWork

Most of the existing publishing schemes for spatial data with differential privacy build grids [20], andthen users perturb the units in the grid based on random responses. However, becausethe scheme perturbs each location in the same way, it does not provide different optimization schemes for different spatial data. Cormodeetal. [11] proposed a privacy space decomposition scheme with central differential privacy, which divides the spatial data into smaller regions, reports the statistical number of points in each region, andanswers the query. Zhangetal. [12] proposed a hierarchical decomposition differential privacy algorithm PrivTree based on quadtree to divide regions. However, when faced with non-uniform data sets, thequery results are not ideal due to the unbalanced data points. Chenetal. [21] studied the aggregation of spatial data with local differential privacy, butthey did not design an effective index structure. Wangetal. [22] proposed a segmentation mechanism to collect and publish high-dimensional spatial data, andused noise distribution to disturb points. However, due to the influence of the segmentation mechanism on data at a single location, theerror was large. This study solved the above problems anddesigned an index of high-precision range query with shuffled differential privacy to achieve adaptive rangequeries.

3. Preliminaries

3.1. RangeQuery

A range query mainly refers to the number of users that satisfy a certain range. Thefollowing parts are one-dimensional, two-dimensional, three-dimensional and even n-dimensional rangequeries.

One-dimensional data: The core idea of a range query of one-dimensional data based on differential privacy is to query the number of user data points in the range of $R Q = [a, b]$ . Figure 1 is a simplified instance of a range query of one-dimensional data. Suppose that there are three one-dimensional data points stored in the cloud server, namely, $p_{1}$ , $p_{2}$ and $p_{3}$ ; thecloud server can have a range query $[3, 4]$ andcan also have larger range queries $[2, 5]$ and $[1, 6]$ . Fora given range query $[a, b]$ , thecloud server returns the count of data points within the range $[a, b]$ .

Two-dimensional data: The core idea of a range query of two-dimensional data based on differential privacy is whether the query point q appears in $R Q = [a, b] \times [c, d]$ . Figure 2 is a simplified instance of a range query of two-dimensional data. Assume that two vectors $v_{1} ⊥ v_{2}$ are randomly generated, and according to $P r o_{v_{i}} (p) = \frac{\vec{p} * \vec{v_{i}}}{|\vec{v_{i}}|}$ , $i = 1, 2$ , calculate the projection of each point. Given the range query $R Q = [a, b] \times [c, d]$ , thecloud server determines whether the point is within $R Q$ , that is, $p_{v_{1} \in [a, b]}$ and $p_{v_{2} \in [c, d]}$ . Then, the query result of RQ can be expressedas

$R Q_{[a, b] \times [c, d]} = \sum_{i = 1}^{n} I_{a \leq x_{i} \leq b, c \leq y_{i} \leq d},$

(1)

where I is the identification function, where avalue of 1 means that the spatial location of the i-th user is within $R Q$ , anda value of 0 means that the spatial location of the user is not within $R Q$ .

Three-dimensional data and higher-dimensional data: The core idea of a range query of three-dimensional data based on differential privacy is whether the query point q appears in $R Q = [a, b] \times [c, d] \times \cdot \cdot \cdot \times [k, l]$ . Figure 3 is a simplified instance of a range query of three-dimensional data. Suppose that three vectors $v_{1} ⊥ v_{2} ⊥ v_{3}$ are randomly generated, andaccording to $P r o_{v_{i}} (p) = \frac{\vec{p} * \vec{v_{i}}}{|\vec{v_{i}}|}$ , $i = 1, 2, 3$ , calculate the projection of each point on it. Given the range query $R Q = [a, b] \times [c, d] \times [e, f]$ , thecloud server determines whether the point is within $R Q$ , that is, $p_{v_{1} \in [a, b]}$ , $p_{v_{2} \in [c, d]}$ and $p_{v_{3} \in [e, f]}$ . Similarly, forthe query $R Q$ (hypercube) of higher-dimensional data, thequery results can be expressed by a mathematical expression as follows:

$R Q_{[a, b] \times [c, d] \times \cdot \cdot \cdot \times [k, l]} = \sum_{i = 1}^{n} I_{a \leq x_{i} \leq b, c \leq y_{i} \leq d} .$

(2)

3.2. Local Differential Privacy

The data is protected by which users send the perturbed data to the third-party server with local differential privacy. The specific mathematical definition of local differential privacy is given as follows:

Definition 1

( $ϵ, δ$ )-local differential privacy). Given a user’s spatial location data and a random algorithm M, forany two spatial data $g_{i}$ and $g_{i}^{'}$ , therandom algorithm M satisfies ( $ϵ, δ$ )-local differential privacy if and only if all possible outputs y satisfy the following inequality:

$P r (M (g_{i} = y)) \leq e^{ϵ} \cdot P r (M (g_{i}^{'} = y)) + δ .$

(3)

The existing local differential privacy protocols are mainly based on the randomized aggregatable privacy-preserving ordinal response (RAPPOR) [23] to implement local perturbations. Thelocal transformation algorithm in this paper is based on our previous work [24].

3.3. Shuffled DifferentialPrivacy

Compared with central differential privacy and local differential privacy, shuffled differential privacy [17,25] consists of a local perturbation, shuffle and analysis. Thespecific mathematical definition of shuffled differential privacy is as follows:

Definition 2

( $ϵ, δ$ )-shuffled differential privacy). Given a user’s spatial location data and a randomly shuffled algorithm $S h$ and local perturbed algorithm M, forany two spatial data $g_{i}$ and $g_{i}^{'}$ , thealgorithm $S = S h \circ M$ satisfies ( $ϵ, δ$ )-shuffled differential privacy if and only if all possible shuffled outputs y satisfy the following inequality:

$P r (S (g_{i} = y)) \leq e^{ϵ} \cdot P r (S (g_{i}^{'} = y)) + δ .$

(4)

The data analysis under shuffled differential privacy is shown in Figure 4: each user perturbs their spatial data with a random oracle to obtain a report $r_{1}, r_{2}, \cdot \cdot \cdot r_{n}$ . Then, multiple shufflers shuffle the report and send it to the server. Theserver counts the query results and finally sends them to the dataanalyst.

4. SystemModel

4.1. SystemModel

As shown in Figure 5, thesystem model involves four participants: query user, data owner, shuffler and cloudserver.

Query users: They are generally the data owner themself or other legitimate users who have some completely trusted internal connection with the data owner. Thequery user hopes to search for the content that they are interested in within the entire datasets provided by all data owners andis not willing to disclose the relevant query results. Inorder to achieve this goal, theauthorized user sends the query to the cloud server through the secure channel in this scheme. Afterreceiving the query request, theserver runs the query algorithm on the index tree and returns the query result to theuser.

Data owners: The n data owners are willing to outsource their data to the cloud server in the form of differential privacy protection, while allowing authorized query users to query all data. Inthis scheme, each data owner has their own spatial data $g_{i}$ . They first locally perturb their location data and then send the message to theshuffler.

Shuffler: The shuffler is an independent semi-trusted server that performs secure shuffled operations without knowing the data. Afterreceiving the data of the user’s local disturbance, it shuffles the receiveddata.

Cloud server: The cloud server is untrusted. Thecloud server can build an empty $_{d p} K D$ tree and share the index with the data owner. Once the data released by the shuffler is received, thetree is reconstructed and the final statistical results are obtained. Finally, thequery results are returned to the query user. Inaddition, when the data owner sends the update information, thecloud server should update its stored indexsynchronously.

The cloud server in this scheme is untrusted. Fromthe user’s view, other parties in the system model, including the shuffler (auxiliary server), cloud server andother data owners, may be adversaries. We assume that all participants have the same level of background knowledge and need to consider the consequences of collusion between different parties. There are three important adversaries: the adversary is the server itself; a conspiracy between the server and other users; and a conspiracy between the server and shuffler. Inparticular, theserver can collude with the auxiliary server or other users. Inthis case, themodel is simplified to local differentialprivacy.

4.2. ProblemDefinition

Given the spatial data $g_{i}$ of n data owners, shufflers and the cloud server, each data owner (user) perturbs the data via an $L T$ algorithm and sends $L T (g_{i})$ to the shuffler. Theshuffler shuffles the received n inputs and outputs the results. Inorder to formalize it, theshuffled mechanism is defined as $S h (D) = S R R_{Q} (L T (g_{1}, \cdot \cdot \cdot, L T (g_{n})))$ . Theserver performs statistical analysis based on the shuffled data, that is, therange query algorithm is $S D R Q : R Q (v_{i}) \to \tilde{R Q}$ . Theproblem that was solved in this study was to design a spatial range query scheme that satisfies the shuffled differential privacy, obtains the query results with high accuracy as far as possible, andsupport sadaptivepartitioning.

4.3. The Design Goal of the Spatial QueryScheme

In order to achieve accurate and secure range queries over privacy-preserving cloud data by multiple data owners, our scheme should meet the following goals and security requirements:

(1): Support multiple data owners: multiple data owners can outsource their data safely and conveniently, andthe complexity of users to create queries is almost unaffected by the number of dataowners.
(2): Ideal query results: cloud servers should help users to query high-precision query results in outsourced datasets by all dataowners.
(3): Efficiency and timeliness: The cloud server receives a massive number of geospatial data queries, andthe spatial data are frequently updated. It is necessary to construct an index that supports efficient queries and updates the performance for these characteristics. This will improve the user’s query experience, butalso provide results closer toreality.
(4): Security: aiming at the range query requirements of outsourcing geographic data with shuffled differential privacy protection, we analyzed the potential attacks and the security of the scheme, which is divided into the following points:

Data confidentiality: the content of the outsourced datasets by the data owner should be protected and should not be known to the cloud server or other unauthorized queryusers.

Query privacy: The query content in the user’s query is sent to the cloud server through a secure channel andwill not be leaked to any adversary. This scheme is not considered.

5. Range Query over SpatialData

5.1. Overview

With the protection of shuffled differential privacy, we propose a hierarchical adaptive division method of spatial data to divide spatial data by using the law of data distribution (Algorithms 1). Thespecific steps are as follows:

(1): Data partitioning (constructing the $_{d p} K D$ tree): Considering sparse and dense regions, the $_{d p} K D$ tree is constructed to divide the spatial datasets. TheAlgorithms 2 and 3 are based on the $K D$ tree and quadtree to segment and index spatialdata.
(2): Shuffling data and reconstructed $_{d p} K D$ tree: Each user that is a data owner first traverses the $_{d p} K D$ tree to find the path containing their own location. Afterfinding the leaf node, theweight of the path from the leaf node to the root node is assigned to 1, andthe weights of the other paths are each assigned a 0. Then, theuser randomly samples and generates a vector. Afterlocally perturbing its location data through the $L T$ method, the user sends it to the shuffler. Theshufflers shuffle the data and sends it to the server. Finally, reconstructing the $_{d p} K D$ tree T after the server collects the reports of allusers.
(3): Response range query: T is traversed and a range query is performed based on the $S D R Q$ algorithm and the final range query result is obtained.

Algorithm 1:

K D R Q

Require:: The dataset $D = d_{1}, d_{2}, \cdot \cdot \cdot, d_{n}$ , range query $R Q$ , privacy budget $ϵ$
Ensure:: The query results $\tilde{R Q}$
1:: //Step 1: Partition
2:: Map the data point of the original dataset to the data space
3:: $P \leftarrow_{d p} K D (D)$
4:: Construct the $_{d p} K D (D)$ tree T
5:: //Step 2: Shuffle and reconstruct the tree
6:: Initialize the dataset
7:: for each user $g_{i}$ in T do
8:: $(l_{i}, y_{i}) \leftarrow S R R_{Q} (g_{i}, ϵ)$
9:: end for
10:: if $l_{i} = l$ then
11:: fornode $v_{i}$ in Tdo
12:: $N^{*} (v_{i}) \leftarrow y_{i} + N (v_{i})$
13:: end for
14:: else
15:: if $N (v_{i}) > d s$ then
16:: $T \leftarrow Q P (v_{i})$
17:: end if
18:: end if
19:: return T
20:: //Step 3: range query
21:: $T \leftarrow n o n$ - $n e g a t i v i t y (T)$
22:: Traversing T from top to bottom
23:: $\tilde{R Q} \leftarrow S R Q (R Q, v)$
24:: return $\tilde{R Q}$

5.2. User DataDivision

The data is divided by constructing a $_{d p} K D$ tree. First, theimportant geographic location information $g_{i}$ in the user spatial data of the dataset is selected for division. Compared with the quadtree, thedimension of the spatial data can be effectively reduced. Then, each dimension in $g_{i}$ is divided according to the selected dividing standard $d s$ , inwhich sparse and dense spaces are considered and the density is calculated. Thedividing standard is selected according to the structure of the $_{d p} K D$ tree. Finally, it is recursively calculated until there is only one leaf node, andthe partition is terminated. Theconstruction of the $_{d p} K D$ tree is as follows (Figure 6): there are six points: (1.2,4), (4.5,2.2), (5,8), (6,10), (8.6,9) and (12,11). Thedimension with the largest variance is selected to divide the space, that is, y = 8, andthe space is divided recursively with the dividing standard of x = 1.2 and x = 8.6 until the number of points in the subspace is 1. Inparticular, to determine that the spatial data contain sparse and dense spaces, thedensity threshold is calculated, andthe space that is less than the density threshold is divided into a sparse space, andthen the space continues to be divided. Thespecific steps are shown in Algorithm 2.

Algorithm 2:

_{d p} K D

Require:: The dataset D, user’s location $g_{i}$
Ensure:: Preliminary data division P
1:: if $g_{i} \neq ⌀$ then
2:: Built cells as the root nodes
3:: Divide the space $g_{i}$ as the selected dividing standard $d s$
4:: for $P_{i} \in P_{1}, P_{2}, \cdot \cdot \cdot, P_{d}$ do
5:: Count the amount of data space $C_{m a x}$
6:: if $C_{m a x} \neq 1$ then
7:: $F = ⌀$ ▹F is the sparse space
8:: Count the density $d e n_{A x i s}$ of the data space ▹ $d e n_{A x i s} = C_{A x i s}$
9:: if $ρ \cdot A v e_{A x i s - X_{i}} > d e n_{A x i s - X_{i}}$ then
10:: $F = F + {A x i s - X_{i}}$ ▹ Take the $A x i s - X$ as an example
11:: While $r d (I n t_{i}, I n t_{i - 1}) < 1$ ▹ $r d (I n t_{i}, I n t_{i - 1}) = \frac{∥ d e n_{i} - d e n_{i - 1} ∥}{d e n_{i} + d e n_{i - 1}}$
12:: Merge intervals $I n t_{i}, I n t_{i - 1}$
13:: Recursive partitioning P
14:: else
15:: return
16:: end for
17:: else
18:: return None
19:: return P

5.3. Shuffle and Reconstruct theTree

After building an empty $_{d p} K D$ tree in the cloud server and sharing it to the user, each user traverses the $_{d p} K D$ tree to find the leaf node of the spatial location and the path of its root node, andassigns 1 to its path. Then, alayer is randomly selected, andthe report is generated by the local transformation. Finally, it is sent to shufflers for shuffling. Thelocal transformation comes from our previous work [17], theshuffled protocol based on multiple shufflers comes from [18] andthe shuffled algorithm is shown in Algorithm 3.

Algorithm 3:

S R R_{Q}

Require:: The user important location $g_{i}$ , privacy budget $ϵ$
Ensure:: The shuffled value $(l_{i}^{'}, y_{i})$
1:: The server shares the T with each user
2:: Each user traverses T to determine the path p
3:: if node $v_{i}$ not in T then
4:: $w e i g h t (v_{i}) = 0$
5:: else
6:: $w e i g h t (v_{i}) = 1$
7:: end if
8:: Randomly select a layer in T to generate a vector $B_{i}$
9:: for $w e i g h t (v_{j})$ in vector $B_{i}$ do
10:: $(l_{i}^{'}, r_{i}) \leftarrow L T (w e i g h t (v_{j}, ϵ))$ ▹ $l_{i}^{'}$ is based on a randomly selected tree layer of T
11:: Each user sends $(l_{i}^{'}, r_{i})$ to the shufflers $S h_{i}$
12:: end for
13:: The shufflers $S h_{i}$ shuffle all the data and send to the server
14:: return $(l_{i}^{'}, y_{i})$

Next, theuser reconstructs the $_{d p} K D$ tree as in lines 10–17 of Algorithm 1.Then, the quadtree partition standard is used to continue to divide the space. Thequadtree partition space algorithm is shown in Algorithm 4. First, the data are divided and the number of data in the data space is calculated. Then, if $P_{n} \leq C_{m a x}$ , the cell is constructed as the root node. Otherwise, according to the quadtree partitioning strategy, four sub-rectangles are constructed and continue to be recursivelypartitioned.

Algorithm 4:

Q P

Require:: The dataset D, user’s location $g_{i}$
Ensure:: Preliminary data division P
1:: Count the amount of data space $C_{m a x}$
2:: if $P_{n} \leq C_{m a x}$ then
3:: Built cells as the root nodes
4:: else
5:: Construct four sub-rectangles $P_{1}, P_{2}, P_{3}, P_{4}$ ▹ According to the quadtree partition strategy
6:: for $P_{i} \in P_{1}, P_{2}, P_{3}, P_{4}$ do
7:: Recursive partitioning P
8:: end for
9:: end if
10:: returnP ▹ The root nodes represent the space P

5.4. RangeQuery

The range query process of the $_{d p} K D$ index can be given in the query rectangle $R Q = [a, b] \times [c, d]$ , where $a, b$ and $c, d$ are the minimum and maximum boundary values on the same dimension, respectively. Afterpost-processing, the range query process is performed. First, the reconstructed $_{d p} K D$ tree is traversed; the corresponding processing according to the relationship between the node and the range query $R Q$ is performed; andfinally, the query result $\tilde{R Q}$ that conforms to the query rectangle is obtained. Thespecific algorithm is asfollows (Algorithm 5).

Algorithm 5:

S D R Q

Require:: The $_{d p} K D$ tree, range query $R Q$
Ensure:: The query result $\tilde{R Q}$
1:: Traverse T from top to bottom ▹ $u n t r a v e l e d (v_{i}) = 0$
2:: if node $v_{i} \subset R Q$ then
3:: $\tilde{R Q} \leftarrow R Q \cup v_{i}$
4:: if node $v_{i}$ is disjoint from $R Q$ then
5:: return
6:: else
7:: ifnode $v_{i}$ is leaf nodethen
8:: Add the intersectant records to candidates
9:: $\tilde{R Q} \leftarrow R Q \cup C a n (v_{i}, R Q)$
10:: else
11:: ifnode $v_{i}$ is not leaf nodethen
12:: Record the child node of $v_{i}$ ▹ $u n t r a v e l e d (v_{i}) = 1$
13:: $\tilde{R Q} \leftarrow S D R Q (v, R Q)$
14:: end if
15:: end if
16:: end if
17:: return $\tilde{R Q}$

5.5. SecurityAnalysis

Our scheme can achieve range queries with a high accuracy with shuffled differential privacy. In the following, we theoretically confirm that our algorithm satisfies shuffled differential privacy.

Theorem1.

The $K D R Q$ algorithm satisfies $(ϵ, δ)$ -shuffled differential privacy.

Proof of Theorem 1.

From the sequence combination of shuffled differential privacy, it can be seen that each shuffled algorithm $S R R_{Q}$ on the given spatial location dataset D satisfies shuffled differential privacy. According to the privacy amplification effect proposed by Balle et al. [5], each data owner uses the $L T$ algorithm to satisfy $(ϵ, δ)$ -local differential privacy when they have the local perturbation, where $ϵ = \sqrt{14 l n (2 / δ) \frac{e^{ϵ} + Ω - 1}{n - 1}}$ . And the domain $ω$ is shuffled through the shuffling algorithm, where $ϵ \leq \sqrt{14 l n (2 / δ) \frac{e^{ϵ} + ω - 1}{n / k - 1}}$ ( $n / / k$ is the number of users participating in the shuffling). □

6. PerformanceEvaluation

6.1. ExperimentSettings

Our experimental platform was a Windows 10 system with an Intel i5-10 505 CPU (3.2 Hz) and a 16 GB memory, and Python was used to implement the algorithm on the Landmark and Checkin datasets. The Landmark dataset includes 870,000 user geolocation data from in the United States and is available from the infochimps platform. The Checkin dataset records 1 million location data of the social networking site Gowalla platform, as shown in Table 2.

Experimental metrics: We used the relative error $R E = \frac{\tilde{R Q} (D) - R Q (D)}{R Q (D)}$ [26] to measure the range query accuracy of $R A P P O R$ [23], private spatial data aggregation ( $P S D A$ ) [21], grid and quadtree of regions ( $G T$ -R) [26], and our algorithm $K D R Q$ . We set the privacy budget parameter values to 0.1, 0.3 and 0.5.

6.2. Evaluation

6.2.1. Landmark-Based $R E$ Comparison

Figure 7, Figure 8 and Figure 9 shows the $R E$ values of the $R A P P O R$ , $P S D A$ , $G T$ -R and $K D R Q$ algorithms in the ranges of [5%,20%], [20%,40%] and [40%,60%] on the Landmark dataset. From Figure 7, it can be seen that the privacy budgets were 0.1, 0.3 and 0.5; the query accuracy of our algorithm $K D R Q$ in the range of [5%,20%] gradually improved; and the query accuracy was higher than that of the other three algorithms. Figure 8 and Figure 9 are similar. In addition, as the query range became larger, the accuracy of the experimental algorithm was improved. This was because the area of the range query became larger and the error was reduced.

6.2.2. Checkin-Based $R E$ Comparison

Figure 10, Figure 11 and Figure 12 show the $R E$ values of the $R A P P O R$ , $P S D A$ , $G T$ -R and $K D R Q$ algorithms on the Checkin dataset in the ranges of [5%,20%], [20%,40%] and [40%,60%]. From Figure 10, we can see that the privacy budgets were 0.1, 0.3 and 0.5; the query accuracy of our algorithm $K D R Q$ in the range of [5%,20%] gradually improved; and the query accuracy was higher than the other three algorithms. Figure 11 and Figure 12 are similar. Because the data distribution of the Checkin dataset is sparse and dense, the query accuracy can be improved by our $K D R Q$ algorithm. In particular, when the privacy budget changed from 0.5 to 0.3, the relative $R E$ values of $G T$ -R and $K D R Q$ were almost same in the range of [40%,60%]. That is, the query accuracies were almost the same, which also shows that we improved the query accuracy while increasing the intensity of privacy protection. In general, taking the Checkin dataset as an example, when the privacy budgets were 0.1, 0.3 and 0.5, the ratios of $R E$ values ranged from 1.63 to 4.52 for all $R E$ ranges.

7. Conclusions

In our paper, aiming at the problem of publishing user’s spatial location with differential privacy protection, combined with the shortcomings of existing range queries, a spatial data publishing method based on $_{d p} K D$ and the quadtree index is proposed that improves the query accuracy through adaptive partitioning. From the perspective of the definition of shuffle differential privacy, $K D R Q$ satisfies $(ϵ, δ)$ -shuffled differential privacy. Finally, the range query accuracy of the $K D R Q$ method was verified by two real large-scale datasets. The experimental results show that $K D R Q$ was obviously superior to the existing similar methods. Furthermore, our work on range queries was aimed at spatial data and needs to be extended to multi-type data. We will carry out the work of graph data publishing with differential privacy in the future.

Author Contributions

Conceptualization, K.L., H.Z. and Y.X.; methodology, K.L.; software, K.L.; validation, H.Z., K.L. and Z.L.; formal analysis, K.L. and H.Z.; investigation, H.Z.; resources, K.L.; writing—original draft preparation, K.L.; writing—review and editing, H.Z.; visualization, Y.X. and K.L.; supervision, Z.L. and K.L.; project administration, H.Z.; funding acquisition, H.Z. All authors read and agreed to the published version of this manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant nos. 62072051 and 62032004).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

n	The number of users
D	Spatial dataset
$g_{i}$	The spatial data of the user
$L T$	Local transform algorithm
l	The layer of tree
$N (v_{i})$	The true value of user locations
$N^{*} (v_{i})$	The estimation value of user locations
$d s$	Dividing standard
$ρ$	Adjustable density parameter
$r d$	The relative density difference of the sparse interval
$Ω$	The domain of local algorithm
$ω$	The domain of shuffled algorithm

References

Available online: https://www.chinanews.com.cn/cj/2023/02-16/9955018.shtmlww (accessed on 1 February 2023).
Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 26–29 October 2013; IEEE: New York City, NY, USA, 2013; pp. 429–438. [Google Scholar]
Balle, B.; Bell, J.; Gascón, A.; Nissim, K. The privacy blanket of the shuffle model. In Proceedings of the Advances in Cryptology–CRYPTO 2019: 39th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2019; Proceedings, Part II 39. Springer: Berlin/Heidelberg, Germany, 2019; pp. 638–667. [Google Scholar]
Cai, K.; Xiao, X.; Cormode, G. Privlava: Synthesizing relational data with foreign keys under differential privacy. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Zhang, J.; Cormode, G.; Procopiuc, C.M.; Srivastava, D.; Xiao, X. Privbayes: Private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 2017, 42, 1–41. [Google Scholar] [CrossRef]
Bayer, R. The universal B-tree for multidimensional indexing: General concepts. In Proceedings of the Worldwide Computing and Its Applications: International Conference, WWCA’97, Tsukuba, Japan, 10–11 March 1997; Springer: Berlin/Heidelberg, Germany, 1997; pp. 198–209. [Google Scholar]
Li, D.; Yang, Q.; An, D.; Zhang, Y. A Location Privacy Aware Taxi-Hailing System: Adaptive Differential Privacy-based Dynamic Incentive Method. IEEE Internet Things J. 2023, 11, 914–930. [Google Scholar] [CrossRef]
Wang, N.; Wang, Y.; Wang, Z.; Nie, J.; Wei, Z.; Tang, P.; Gu, Y.; Yu, G. PrivNUD: Effective Range Query Processing under Local Differential Privacy. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; IEEE: New York City, NY, USA, 2023; pp. 2660–2672. [Google Scholar]
Cormode, G.; Procopiuc, C.; Srivastava, D.; Shen, E.; Yu, T. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; IEEE: New York City, NY, USA, 2012; pp. 20–31. [Google Scholar]
Zhang, J.; Xiao, X.; Xie, X. Privtree: A differentially private algorithm for hierarchical decompositions. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 155–170. [Google Scholar]
Kim, J.W.; Kim, D.H.; Jang, B. Application of local differential privacy to collection of indoor positioning data. IEEE Access 2018, 6, 4276–4286. [Google Scholar] [CrossRef]
Kulkarni, T. Answering range queries under local differential privacy. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1832–1834. [Google Scholar]
Du, L.; Zhang, Z.; Bai, S.; Liu, C.; Ji, S.; Cheng, P.; Chen, J. AHEAD: Adaptive hierarchical decomposition for range query under local differential privacy. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 1266–1288. [Google Scholar]
Ahuja, R.; Zeighami, S.; Ghinita, G.; Shahabi, C. A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Bittau, A.; Erlingsson, Ú.; Maniatis, P.; Mironov, I.; Raghunathan, A.; Lie, D.; Rudominer, M.; Kode, U.; Tinnes, J.; Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 441–459. [Google Scholar]
Wang, T.; Ding, B.; Xu, M.; Huang, Z.; Hong, C.; Zhou, J.; Li, N.; Jha, S. Improving utility and security of the shuffler-based differential privacy. arXiv 2019, arXiv:1908.11515. [Google Scholar] [CrossRef]
Cheu, A.; Zhilyaev, M. Differentially private histograms in the shuffle model from fake users. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; IEEE: New York City, NY, USA, 2022; pp. 440–457. [Google Scholar]
Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Optimal geo-indistinguishable mechanisms for location privacy. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 251–262. [Google Scholar]
Chen, R.; Li, H.; Qin, A.K.; Kasiviswanathan, S.P.; Jin, H. Private spatial data aggregation in the local setting. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; IEEE: New York City, NY, USA, 2016; pp. 289–300. [Google Scholar]
Wang, N.; Xiao, X.; Yang, Y.; Zhao, J.; Hui, S.C.; Shin, H.; Shin, J.; Yu, G. Collecting and analyzing multidimensional data with local differential privacy. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, Macao, 8–11 April 2019; IEEE: New York City, NY, USA, 2019; pp. 638–649. [Google Scholar]
Erlingsson, Ú.; Pihur, V.; Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
Zhang, H.; Li, K.; Huang, T.; Zhang, X.; Li, W.; Jin, Z.; Gao, F.; Gao, M. Publishing locally private high-dimensional synthetic data efficiently. Inf. Sci. 2023, 633, 343–356. [Google Scholar] [CrossRef]
Li, X.; Liu, W.; Feng, H.; Huang, K.; Hu, Y.; Liu, J.; Ren, K.; Qin, Z. Privacy enhancement via dummy points in the shuffle model. IEEE Trans. Dependable Secur. Comput. 2023, 1001–1016. [Google Scholar] [CrossRef]
Zhang, X.; Fu, N.; Meng, X. Spatial Range Query Method based on Local differential privacy. J. Comput. Res. Dev. 2020, 57, 847–858. [Google Scholar]

Figure 1. Range queries of 1-dimensional data.

Figure 2. Range queries of 2-dimensional data.

Figure 3. Range queries of 3-dimensional data.

Figure 4. Data analysis with shuffled differential privacy.

Figure 5. The framework of system.

Figure 6. The example of $K D$ tree index.

Figure 7. The RE in the range [5%,20%].

Figure 8. The RE in the range [20%,40%].

Figure 9. The RE in the range [40%,60%].

Figure 10. The RE in the range [5%, 20%].

Figure 11. The RE in the range [20%, 40%].

Figure 12. The RE in the range [40%, 60%].

Table 1. Comparison with related works.

Scheme	SDP Protection	High Dimension	Spatial Data	Data Density
AHEAD [13]	×	√	√	√
PSDA [18]	×	×	√	×
HaarHRR [12]	×	√	×	×
SM [19]	×	×	√	×
Our Work	√	√	√	√

Table 2. Datasets.

Datasets	Dataset Size	Position Domain	Sample Size
Landmark	$8.7 \times 10^{5}$	$[- 124.4, - 67] \times [24.6, 49]$	$6 \times 10^{5}$
Checkin	$10^{6}$	$[- 176.3, 177.5] \times [- 48.2, 90]$	$6 \times 10^{5}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Range Query Scheme for Spatial Data with Shuffled Differential Privacy (2024)

1. Introduction

2. RelatedWork

3. Preliminaries

3.1. RangeQuery

3.2. Local Differential Privacy

3.3. Shuffled DifferentialPrivacy

4. SystemModel

4.1. SystemModel

4.2. ProblemDefinition

4.3. The Design Goal of the Spatial QueryScheme

5. Range Query over SpatialData

5.1. Overview

5.2. User DataDivision

5.3. Shuffle and Reconstruct theTree

5.4. RangeQuery

5.5. SecurityAnalysis

6. PerformanceEvaluation

6.1. ExperimentSettings

6.2. Evaluation

6.2.1. Landmark-Based R E Comparison

6.2.2. Checkin-Based R E Comparison

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

References

6.2.1. Landmark-Based $R E$ Comparison

6.2.2. Checkin-Based $R E$ Comparison