Aapche Nutch建立自己的搜索引擎

sudo apt install default-jdk‘

java -version
openjdk version "11.0.22" 2024-01-16

vi .bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

爬梯子下载源代码 Apache Nutch™ – Downloads

mkdir -p urls
cd urls
touch seed.txt 
里面放入我的网站地址

bin/nutch inject crawl/crawldb urls
显示
 Injecting seed URL file file:/data/apache-nutch-1.19/urls/seed.txt
Total new urls injected: 1

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

bin/nutch generate crawl/crawldb crawl/segments

apache nutch  No agents listed in 'http.agent.name' property.
conf/ nutch-site.xml
    <property>
      <name>http.agent.name</name>
      <value>MyNutchBot/1.0</value>
    </property>


 export APACHE_SOLR_HOME=/data/solr-8.11.3
export NUTCH_RUNTIME_HOME=/data/apache-nutch-1.19
${APACHE_SOLR_HOME}/bin/solr start -force
open file limit is currently 1024
vi /etc/security/limits.conf
* soft nofile 4096
* hard nofile 4096
Started Solr server on port 8983 (pid=29369). Happy searching!
http://192.168.1.131:8983

${APACHE_SOLR_HOME}/bin/solr start -force
 
${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/ -force

ls crawl/segments/

 bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20240326063028/ -filter -normalize -deleteGone
 
https://dlcdn.apache.org/lucene/solr/8.11.3/solr-8.11.3.tgz

https://nutch.apache.org/download/
https://dlcdn.apache.org/nutch/1.19/apache-nutch-1.19-bin.tar.gz

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

去掉robot的处理
https://blog.csdn.net/jediael_lu/article/details/43227693

相关推荐

  1. Aapche Nutch建立自己搜索引擎

    2024-03-27 08:10:04       40 阅读
  2. 如何用 AI 工具建立自己知识库?

    2024-03-27 08:10:04       54 阅读
  3. git mv命令不会自动建立目录

    2024-03-27 08:10:04       98 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-03-27 08:10:04       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-03-27 08:10:04       101 阅读
  3. 在Django里面运行非项目文件

    2024-03-27 08:10:04       82 阅读
  4. Python语言-面向对象

    2024-03-27 08:10:04       91 阅读

热门阅读

  1. 数据分析中常用的9大算法原理

    2024-03-27 08:10:04       48 阅读
  2. document.getElementById(‘username‘).value 是什么

    2024-03-27 08:10:04       42 阅读
  3. Jmeter 聚合报告之 90% Line 正确理解

    2024-03-27 08:10:04       35 阅读
  4. ASP.NET单选框与多选框值获取

    2024-03-27 08:10:04       36 阅读
  5. C#学习笔记

    2024-03-27 08:10:04       41 阅读
  6. Redis的持久化机制是怎样的?

    2024-03-27 08:10:04       42 阅读
  7. Day58| 739 每日温度 496 下一个更大元素 I

    2024-03-27 08:10:04       39 阅读
  8. Django——Ajax请求

    2024-03-27 08:10:04       38 阅读
  9. 2960. 统计已测试设备

    2024-03-27 08:10:04       41 阅读