文章

Neo4j

Neo4j

图数据库

图数据库是基于图论实现的一种NoSQL数据库,其数据结构和数据查询方式都是以图论为基础的,图数据库主要用于存储更多的连接数据。

和关系数据库(Relational Database Management System)的对比

关系数据库中,对象之间的关系用外键表示。图数据库中,节点和关系取代表、外键、join。无论何时运行类似JOIN地操作,数据库都会直接访问连接的节点,而无需进行昂贵的搜索和匹配计算。

其他NoSQL还有:键值数据库、列存储数据库、文档型数据库、图数据库。

知识图谱

数据提取:实体抽取、语义标签抽取、关系抽取、事件抽取。数据很多,关系很难,设计大量nlp技术。

自动回答流程:数据集构建、命名实体识别、找对应关系、在图中返回结果

常用技术点:关系抽取、指代消解

图的嵌入

数据库操作

1
2
3
4
5
6
7
8
9
10
11
12
create (n:Person {name:'我', age:31})
create (p:Person {name:'小明', age:25}) -[:借钱 {金额:10000}]-> (n:Person {name:'小李', age:26})
match (p:Person {name:'小明', age:25}) -[f:借钱 {金额:10000}]-> (n:Person {name:'小李', age:26}) delete f
match (p:Person {name:'小明', age:25}) delete p
match (p:Person {name:'小李'}) delete p

match (t:Person) where id(t)=0 set t:Boy
match (t:Person) where id(t)=0 set t.gender="m"

match (p:Person)-[:`借钱`]-> (n:Person) return p,n

match(n) detach delete (n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from py2neo import Node, Graph, NodeMatcher, Relationship
g = Graph('http://localhost:7474', auth=('neo4j', 'password'), name='learn')

xiaoming = Node('Person', name='小明', age=10, gender='m')
xiaofang = Node('Person', name='小芳', age=11, gender='f')
xiaozhao = Node('Person', name='小赵', age=12, gender='m')
g.create(xiaoming)
g.create(xiaofang)
g.create(xiaozhao)
g.create(Relationship(xiaoming, '借钱', xiaofang, money=100))
g.create(Relationship(xiaoming, '借钱', xiaofang, money=200))
g.create(Relationship(xiaozhao, '借钱', xiaoming, money=150))
matcher = NodeMatcher(g)
print(matcher.match('Person').where(f"_.gender='m'").all())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain_community.graphs import Neo4jGraph
NEO4J_URL = 'bolt://localhost:7687'
NEO4J_USERNAME = 'neo4j'
NEO4J_PASSWORD = 'password'
NEO4J_DATABASE = 'neo4j'

kg = Neo4jGraph(url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE)

cypher = """
match (n:Disease)
return count(n) as number_of_disease
"""

result = kg.query(cypher)

更多CQL语句

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
match (n:Disease {name:"苯中毒"})
return n.cause

match (n:Movie)
where n.released >= 1990
and n.released < 2000
return n.title

match (a:Person)-[:ACTED_IN]->(m:Movie)
return a.name, m.title limit 10

match (a:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p:Person)
return p.name, m.title limit 10

match (a:Person {name:"Emil Eifrem"})-[act:ACTED_IN]->(m:Movie)
delete act

create (a:Person {name:"me"})
return a

match (a:Person {name:"me"}), (n:Person {name:"Emil Eifrem"})
merge (a)-[r:WORKS_WITH]->(n)
return a, r, n

给文本计算嵌入

1
2
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'

创建向量索引

1
2
3
4
5
6
7
8
9
10
11
12
13
kg.query("""
  CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
  FOR (m:Movie) ON (m.taglineEmbedding)
  OPTIONS { indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }}"""
)

kg.query("""
  SHOW VECTOR INDEXES
  """
)

给每部电影的tagline计算索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
kg.query("""
    MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
    WITH movie, genai.vector.encode(
        movie.tagline, 
        "OpenAI", 
        {
          token: $openAiApiKey,
          endpoint: $openAiEndpoint
        }) AS vector
    CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
    """, 
    params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT} )

result = kg.query("""
    MATCH (m:Movie) 
    WHERE m.tagline IS NOT NULL
    RETURN m.tagline, m.taglineEmbedding
    LIMIT 1
    """
)

进行相似度查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
question = "What movies are about love?"
kg.query("""
    WITH genai.vector.encode(
        $question, 
        "OpenAI", 
        {
          token: $openAiApiKey,
          endpoint: $openAiEndpoint
        }) AS question_embedding
    CALL db.index.vector.queryNodes(
        'movie_tagline_embeddings', 
        $top_k, 
        question_embedding
        ) YIELD node AS movie, score
    RETURN movie.title, movie.tagline, score
    """, 
    params={"openAiApiKey":OPENAI_API_KEY,
            "openAiEndpoint": OPENAI_ENDPOINT,
            "question": question,
            "top_k": 5
            })

cpyher语法基础

  • (n) 节点
  • (a)->(b)(a)<-(b)(a)-(b) 关系
  • (a:User) 标签
  • (a {name:'xiaoming', gender:'m'}) 属性

变长关系匹配:

(a)-[*2]->(b)相当于(a)-->()-->(b) (a)-[*3..]->(b)至少三个关系

常见关键字:

  • create 创建节点或关系
  • merge 先查找一模一样的节点,找到就合并,找不到就创建
  • set 设置节点标签和属性
  • delete 删除节点和关系
  • remove 删除属性和标签
  • foreach
  • create unique
  • match 匹配数据
  • optional match
  • where 追加匹配条件
  • start
  • return 返回结果
  • order by 排序
  • limit 限制输出的个数
  • skip 跳过若干个输出
  • with
  • unwind
  • union/union all
  • call
  • case

带查找和匹配的merge:

1
2
3
4
5
6
// 没有则创建节点并设置属性
merge (n:Person{name:"张三"}) on create set n.created=timestamp() return n
// 存在这个节点则创建并赋予属性 不存在则只创建节点
merge (n:Person{name:"张三"}) on match set n.again=True return n
// 两个可以一起使用
merge (n:Person{name:"张三"}) on create set n.created = timestamp() on match set n.lastSeen = timestamp() return n

唯一性约束:

1
create constraint for (n:Person) require n.name is unique

设置标签

1
match (n:Person{name:"张三"}) set n:Chinese:Student return n

删除操作

1
2
3
4
## 删除所有节点以及所有关系
match (n) detach delete n
## 删除所有没有关系的节点
match (n) where count {(n)--()}=0 delete n

匹配不到时返回null

1
match (a:Person{name:"zhangsan"}) optional match (a)-->(x) return x

聚合函数

  • count
  • sum
  • avg
  • max
  • min
  • collect
  • distinct
  • percentileDisc
  • percentileCout
  • stdev
  • stdep

with 的作用,将查询的结果传递给另一部分继续查询

1
match (n {name: "zhangsan" })--(m)-->(s) with m, count(*) as m_count where m_count>1 return m

数据类型:数值、字符串、布尔、空间、时间

运算符

= <> < > <= >= is null is not null

starts with ends with contains

字符串拼接:+ 正则匹配=~

列表运算符 + in [] |

1
2
return [x in [1, 2, 3, 4, 5] where x%2=0 | x^2]
with ['Anne', 'John', 'Bill', 'Diane', 'Eve'] as names return names[1..3] as result

参数,使用参数加美元符

1
2
3
{
  "name": "peerReviews"
}

索引

索引分为两类。

  • 搜索性能索引: 用于加速基于精确匹配的数据检索。此类别包括范围、文本、点和标记查找索引。
  • 语义索引: 用于近似匹配和计算查询字符串与匹配数据之间的相似性分数。此类别包括全文索引和矢量索引。

搜索性能索引:

  • 范围索引(range indexes): Neo4j的默认索引,支持大多数类型的谓词。
  • 文本索引(text indexes): 解决在string值上操作的谓词。针对字符串操作符contains和ends with的查询过滤进行了优化。
  • 点索引(point indexes): 解决空间点值上的谓词,针对距离或边界框内的查询进行了优化。
  • 令牌查找索引(token lookup indexes): 仅解决节点标签和关系类型谓词(即它们不能解决属性上的任何谓词过滤)。在Neo4j中创建数据库时,会出现两个令牌查找索引(一个用于节点标签,另一个用于关系类型)。

语义索引:

  • Full-text indexes(全文索引): 支持在STRING属性的内容中进行搜索,并支持查询字符串与存储在数据库中的STRING值之间的相似性比较。
  • Vector indexes(向量索引): 通过将节点或属性表示为多维空间中的向量,支持相似性搜索和复杂的分析查询。

全文索引一例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// 测试用数据
CREATE (nilsE:Employee {name: "Nils-Erik Karlsson", position: "Engineer", team: "Kernel", peerReviews: ['Nils-Erik is difficult to work with.', 'Nils-Erik is often late for work.']}),
(lisa:Manager {name: "Lisa Danielsson", position: "Engineering manager"}),
(nils:Employee {name: "Nils Johansson", position: "Engineer", team: "Operations"}),
(maya:Employee {name: "Maya Tanaka", position: "Senior Engineer", team:"Operations"}),
(lisa)-[:REVIEWED {message: "Nils-Erik is reportedly difficult to work with."}]->(nilsE),
(maya)-[:EMAILED {message: "I have booked a team meeting tomorrow."}]->(nils)

// 构建索引
CREATE FULLTEXT INDEX namesAndTeams FOR (n:Employee|Manager) ON EACH [n.name, n.team]
CREATE FULLTEXT INDEX communications FOR ()-[r:REVIEWED|EMAILED]-() ON EACH [r.message]
CREATE FULLTEXT INDEX peerReviews FOR (n:Employee|Manager) ON EACH [n.peerReviews]
OPTIONS {
  indexConfig: {
    `fulltext.analyzer`: 'english', // 英文分析器
    `fulltext.eventually_consistent`: true // 
  }
}

// 节点查找
CALL db.index.fulltext.queryNodes("namesAndTeams", "nils") YIELD node, score
RETURN node.name, score

// 关系查找
CALL db.index.fulltext.queryRelationships("communications", "meeting") YIELD relationship, score
RETURN type(relationship), relationship.message, score

// 支持布尔运算符(这什么原理?)
CALL db.index.fulltext.queryNodes("namesAndTeams", 'nils AND kernel') YIELD node, score
RETURN node.name, node.team, score

// 在特定属性中查找
// 查询特定属性的全文索引
CALL db.index.fulltext.queryNodes("namesAndTeams", 'team:"Operations"') YIELD node, score
RETURN node.name, node.team, score

// 删除索引
DROP INDEX communications

向量值索引的创建

CREATE VECTOR INDEX [index_name] [IF NOT EXISTS] FOR (n:LabelName) ON (n.propertyName) OPTIONS "{" option: value[, ...] "}" CREATE VECTOR INDEX [index_name] [IF NOT EXISTS] FOR ()-”[“r:TYPE_NAME”]”-() ON (r.propertyName)OPTIONS "{" option: value[, ...] "}"

例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// (:Title)<--(:Paper)-->(:Abstract)

CREATE VECTOR INDEX `abstract-embeddings`
FOR (n: Abstract) ON (n.embedding)
OPTIONS {indexConfig: {
 `vector.dimensions`: 1536,  // 嵌入维度
 `vector.similarity_function`: 'cosine' // 相似性函数
}}

// 查找
MATCH (title:Title)<--(:Paper)-->(abstract:Abstract)
WHERE toLower(title.text) = 'efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs'
CALL db.index.vector.queryNodes('abstract-embeddings', 10, abstract.embedding)
YIELD node AS similarAbstract, score
MATCH (similarAbstract)<--(:Paper)-->(similarTitle:Title)
RETURN similarTitle.text AS title, score
本文由作者按照 CC BY 4.0 进行授权