Neo4j
图数据库
图数据库是基于图论实现的一种NoSQL数据库,其数据结构和数据查询方式都是以图论为基础的,图数据库主要用于存储更多的连接数据。
和关系数据库(Relational Database Management System)的对比
关系数据库中,对象之间的关系用外键表示。图数据库中,节点和关系取代表、外键、join。无论何时运行类似JOIN地操作,数据库都会直接访问连接的节点,而无需进行昂贵的搜索和匹配计算。
其他NoSQL还有:键值数据库、列存储数据库、文档型数据库、图数据库。
知识图谱
数据提取:实体抽取、语义标签抽取、关系抽取、事件抽取。数据很多,关系很难,设计大量nlp技术。
自动回答流程:数据集构建、命名实体识别、找对应关系、在图中返回结果
常用技术点:关系抽取、指代消解
图的嵌入
数据库操作
1
2
3
4
5
6
7
8
9
10
11
12
create (n:Person {name:'我', age:31})
create (p:Person {name:'小明', age:25}) -[:借钱 {金额:10000}]-> (n:Person {name:'小李', age:26})
match (p:Person {name:'小明', age:25}) -[f:借钱 {金额:10000}]-> (n:Person {name:'小李', age:26}) delete f
match (p:Person {name:'小明', age:25}) delete p
match (p:Person {name:'小李'}) delete p
match (t:Person) where id(t)=0 set t:Boy
match (t:Person) where id(t)=0 set t.gender="m"
match (p:Person)-[:`借钱`]-> (n:Person) return p,n
match(n) detach delete (n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from py2neo import Node, Graph, NodeMatcher, Relationship
g = Graph('http://localhost:7474', auth=('neo4j', 'password'), name='learn')
xiaoming = Node('Person', name='小明', age=10, gender='m')
xiaofang = Node('Person', name='小芳', age=11, gender='f')
xiaozhao = Node('Person', name='小赵', age=12, gender='m')
g.create(xiaoming)
g.create(xiaofang)
g.create(xiaozhao)
g.create(Relationship(xiaoming, '借钱', xiaofang, money=100))
g.create(Relationship(xiaoming, '借钱', xiaofang, money=200))
g.create(Relationship(xiaozhao, '借钱', xiaoming, money=150))
matcher = NodeMatcher(g)
print(matcher.match('Person').where(f"_.gender='m'").all())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain_community.graphs import Neo4jGraph
NEO4J_URL = 'bolt://localhost:7687'
NEO4J_USERNAME = 'neo4j'
NEO4J_PASSWORD = 'password'
NEO4J_DATABASE = 'neo4j'
kg = Neo4jGraph(url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE)
cypher = """
match (n:Disease)
return count(n) as number_of_disease
"""
result = kg.query(cypher)
更多CQL语句
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
match (n:Disease {name:"苯中毒"})
return n.cause
match (n:Movie)
where n.released >= 1990
and n.released < 2000
return n.title
match (a:Person)-[:ACTED_IN]->(m:Movie)
return a.name, m.title limit 10
match (a:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p:Person)
return p.name, m.title limit 10
match (a:Person {name:"Emil Eifrem"})-[act:ACTED_IN]->(m:Movie)
delete act
create (a:Person {name:"me"})
return a
match (a:Person {name:"me"}), (n:Person {name:"Emil Eifrem"})
merge (a)-[r:WORKS_WITH]->(n)
return a, r, n
给文本计算嵌入
1
2
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings'
创建向量索引
1
2
3
4
5
6
7
8
9
10
11
12
13
kg.query("""
CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}"""
)
kg.query("""
SHOW VECTOR INDEXES
"""
)
给每部电影的tagline计算索引
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
kg.query("""
MATCH (movie:Movie) WHERE movie.tagline IS NOT NULL
WITH movie, genai.vector.encode(
movie.tagline,
"OpenAI",
{
token: $openAiApiKey,
endpoint: $openAiEndpoint
}) AS vector
CALL db.create.setNodeVectorProperty(movie, "taglineEmbedding", vector)
""",
params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT} )
result = kg.query("""
MATCH (m:Movie)
WHERE m.tagline IS NOT NULL
RETURN m.tagline, m.taglineEmbedding
LIMIT 1
"""
)
进行相似度查询
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
question = "What movies are about love?"
kg.query("""
WITH genai.vector.encode(
$question,
"OpenAI",
{
token: $openAiApiKey,
endpoint: $openAiEndpoint
}) AS question_embedding
CALL db.index.vector.queryNodes(
'movie_tagline_embeddings',
$top_k,
question_embedding
) YIELD node AS movie, score
RETURN movie.title, movie.tagline, score
""",
params={"openAiApiKey":OPENAI_API_KEY,
"openAiEndpoint": OPENAI_ENDPOINT,
"question": question,
"top_k": 5
})
cpyher语法基础
(n)
节点(a)->(b)
、(a)<-(b)
、(a)-(b)
关系(a:User)
标签(a {name:'xiaoming', gender:'m'})
属性
变长关系匹配:
(a)-[*2]->(b)
相当于(a)-->()-->(b)
(a)-[*3..]->(b)
至少三个关系
常见关键字:
create
创建节点或关系merge
先查找一模一样的节点,找到就合并,找不到就创建set
设置节点标签和属性delete
删除节点和关系remove
删除属性和标签foreach
create unique
match
匹配数据optional match
where
追加匹配条件start
return
返回结果order by
排序limit
限制输出的个数skip
跳过若干个输出with
unwind
union/union all
call
case
带查找和匹配的merge:
1
2
3
4
5
6
// 没有则创建节点并设置属性
merge (n:Person{name:"张三"}) on create set n.created=timestamp() return n
// 存在这个节点则创建并赋予属性 不存在则只创建节点
merge (n:Person{name:"张三"}) on match set n.again=True return n
// 两个可以一起使用
merge (n:Person{name:"张三"}) on create set n.created = timestamp() on match set n.lastSeen = timestamp() return n
唯一性约束:
1
create constraint for (n:Person) require n.name is unique
设置标签
1
match (n:Person{name:"张三"}) set n:Chinese:Student return n
删除操作
1
2
3
4
## 删除所有节点以及所有关系
match (n) detach delete n
## 删除所有没有关系的节点
match (n) where count {(n)--()}=0 delete n
匹配不到时返回null
1
match (a:Person{name:"zhangsan"}) optional match (a)-->(x) return x
聚合函数
count
sum
avg
max
min
collect
distinct
percentileDisc
percentileCout
stdev
stdep
with 的作用,将查询的结果传递给另一部分继续查询
1
match (n {name: "zhangsan" })--(m)-->(s) with m, count(*) as m_count where m_count>1 return m
数据类型:数值、字符串、布尔、空间、时间
运算符
=
<>
<
>
<=
>=
is null
is not null
starts with
ends with
contains
字符串拼接:+
正则匹配=~
列表运算符 +
in
[]
|
1
2
return [x in [1, 2, 3, 4, 5] where x%2=0 | x^2]
with ['Anne', 'John', 'Bill', 'Diane', 'Eve'] as names return names[1..3] as result
参数,使用参数加美元符
1
2
3
{
"name": "peerReviews"
}
索引
索引分为两类。
- 搜索性能索引: 用于加速基于精确匹配的数据检索。此类别包括范围、文本、点和标记查找索引。
- 语义索引: 用于近似匹配和计算查询字符串与匹配数据之间的相似性分数。此类别包括全文索引和矢量索引。
搜索性能索引:
- 范围索引(range indexes): Neo4j的默认索引,支持大多数类型的谓词。
- 文本索引(text indexes): 解决在string值上操作的谓词。针对字符串操作符contains和ends with的查询过滤进行了优化。
- 点索引(point indexes): 解决空间点值上的谓词,针对距离或边界框内的查询进行了优化。
- 令牌查找索引(token lookup indexes): 仅解决节点标签和关系类型谓词(即它们不能解决属性上的任何谓词过滤)。在Neo4j中创建数据库时,会出现两个令牌查找索引(一个用于节点标签,另一个用于关系类型)。
语义索引:
- Full-text indexes(全文索引): 支持在STRING属性的内容中进行搜索,并支持查询字符串与存储在数据库中的STRING值之间的相似性比较。
- Vector indexes(向量索引): 通过将节点或属性表示为多维空间中的向量,支持相似性搜索和复杂的分析查询。
全文索引一例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// 测试用数据
CREATE (nilsE:Employee {name: "Nils-Erik Karlsson", position: "Engineer", team: "Kernel", peerReviews: ['Nils-Erik is difficult to work with.', 'Nils-Erik is often late for work.']}),
(lisa:Manager {name: "Lisa Danielsson", position: "Engineering manager"}),
(nils:Employee {name: "Nils Johansson", position: "Engineer", team: "Operations"}),
(maya:Employee {name: "Maya Tanaka", position: "Senior Engineer", team:"Operations"}),
(lisa)-[:REVIEWED {message: "Nils-Erik is reportedly difficult to work with."}]->(nilsE),
(maya)-[:EMAILED {message: "I have booked a team meeting tomorrow."}]->(nils)
// 构建索引
CREATE FULLTEXT INDEX namesAndTeams FOR (n:Employee|Manager) ON EACH [n.name, n.team]
CREATE FULLTEXT INDEX communications FOR ()-[r:REVIEWED|EMAILED]-() ON EACH [r.message]
CREATE FULLTEXT INDEX peerReviews FOR (n:Employee|Manager) ON EACH [n.peerReviews]
OPTIONS {
indexConfig: {
`fulltext.analyzer`: 'english', // 英文分析器
`fulltext.eventually_consistent`: true //
}
}
// 节点查找
CALL db.index.fulltext.queryNodes("namesAndTeams", "nils") YIELD node, score
RETURN node.name, score
// 关系查找
CALL db.index.fulltext.queryRelationships("communications", "meeting") YIELD relationship, score
RETURN type(relationship), relationship.message, score
// 支持布尔运算符(这什么原理?)
CALL db.index.fulltext.queryNodes("namesAndTeams", 'nils AND kernel') YIELD node, score
RETURN node.name, node.team, score
// 在特定属性中查找
// 查询特定属性的全文索引
CALL db.index.fulltext.queryNodes("namesAndTeams", 'team:"Operations"') YIELD node, score
RETURN node.name, node.team, score
// 删除索引
DROP INDEX communications
向量值索引的创建
CREATE VECTOR INDEX [index_name] [IF NOT EXISTS] FOR (n:LabelName) ON (n.propertyName) OPTIONS "{" option: value[, ...] "}"
CREATE VECTOR INDEX [index_name] [IF NOT EXISTS] FOR ()-”[“r:TYPE_NAME”]”-() ON (r.propertyName)OPTIONS "{" option: value[, ...] "}"
例子
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// (:Title)<--(:Paper)-->(:Abstract)
CREATE VECTOR INDEX `abstract-embeddings`
FOR (n: Abstract) ON (n.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 1536, // 嵌入维度
`vector.similarity_function`: 'cosine' // 相似性函数
}}
// 查找
MATCH (title:Title)<--(:Paper)-->(abstract:Abstract)
WHERE toLower(title.text) = 'efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs'
CALL db.index.vector.queryNodes('abstract-embeddings', 10, abstract.embedding)
YIELD node AS similarAbstract, score
MATCH (similarAbstract)<--(:Paper)-->(similarTitle:Title)
RETURN similarTitle.text AS title, score