需要一种数据结构来处理带有标签集的元素,并能高效地查找其子集的所有元素

我有一些元素在集合 A 中,这些元素有一组来自集合 B 的标签集。是否有一种数据结构支持如下操作?

  1. 构建,在给定所有可能的标签和所有元素的情况下。(在我进行任何查找之前,我将知道所有的标签和元素)
  2. 给定一组标签,高效地查找其子集的所有元素。

我在 Lua 中进行这项操作,因此我可以使用可变表格,具有常(足够)时间插入和删除以及线性遍历的功能。

朴素的方法是保留所有元素的列表,然后遍历每个元素,并检查其标签是否是输入的子集。这个方法的时间复杂度为 O(nw),其中 n 是元素数量,w 是一个元素具有的最大标签数。w 很可能永远不会超过 10,因此这个时间复杂度可以被认为是线性的。

是否有一种数据结构可以在亚线性的时间内给我这样的查找?

这个问题的背景是一个简单的化学反应系统:化学反应列出反应物,在化学容器改变其成分时,我需要找到所有适用的反应物,也就是其反应物全部存在于容器中的反应物。这个问题推广到任何需要完成一组事情时都是适用的。

点赞
用户1988604
用户1988604

Sub-linear is not possible in the worst case because if the given set of tags contains ALL tags, then ALL elements must be returned. But here's an algorithm that can be sub-linear in the average case.

For each element, use a 128-bit integer (2 x 64-bit integers if you have to) to represent the 100 or so possible tags (i.e. each bit represents a different tag: 1 if the element has the tag, and 0 otherwise.)

Then sort the array based on the 128-bit integer representation.

Given a list of tags, turn it into the 128-bit integer representation so that you can use bit-wise operations to determine whether an element has a subset of the given tags or not. (Use OR operation, followed by a count of the number of 1s in the result.)

The actual speed up (other than the speed up by using bitwise operations) is explained below. For this, let us simplify the input down to 8 bits.

Suppose the list of tags were given as 01001100.

Then you can search for 00000100 in the list of elements in O(logn) time because the list is sorted, and grab a list of elements until 00000101 is reached. Then you can jump to 00001000, again in O(logn) time, and search there until you reach 00001101, then jump to 01000000, and so on.

The end result is that you've jumped over a significant number of elements, at the cost of spending O(tlogn) for the jumps, where t is the number of tags in the input set of tags.

You can take this idea a bit further by jumping more. i.e. jump to 00000100, then jump to 00001000, then to 00001100, then to 01000000, then to 01000100, and so on. This results in spending O(2^tlogn) time for jumping around, but this maybe worth it if t << n and 2^t is manageable.

2016-07-18 02:08:09