您现在的位置是:首页 > 正文

Spark:JupyterNotebook整合PySpark开发环境

2024-04-01 00:16:28阅读 5

记录下

基础环境

  • JDK8
  • Python3.7

Window搭建Spark环境

先把JDK8和Python3装好,这里不赘述

安装hadoop2.7

  • 下载地址:http://archive.apache.org/dist/hadoop/core/hadoop-2.7.7/hadoop-2.7.7.tar.gz

  • 解压

  • 下载hadoop的winutils:https://github.com/steveloughran/winutils

  • 把下载的winutils放到解压的hadoop目录的bin目录下

  • 设置hadoop的JAVA_HOME

    修改hadoop目录下的etc/hadoop/hadoop-env.cmd文件:设置实际java安装目录

    set JAVA_HOME=%JAVA_HOME%
    

    改成

    set JAVA_HOME=E:\study\jdk1.8.0_144
    
  • 设置HADOOP环境变量

    方式同配置JDK环境变量一样:

    新建HADOOP_HOME变量,值为解压的hadoop根目录;

    %HADOOP_HOME%\bin加到Path

  • cmd测试hadoop是否安装好

    cmd --》运行hadoophadoop version

 C:\Users\明柯>hadoop version                                                           Hadoop 2.7.7                                                                           
 Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac                         
 Compiled by stevel on 2018-07-18T22:47Z                                               
 Compiled with protoc 2.5.0                                                        
  From source with checksum 792e15d20b12c74bd6f19a1fb886490                         
  This command was run using /F:/ITInstall/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar 

若出现 Error: JAVA_HOME is incorrectly set,一般是你的jdk安装到了C盘,把他移到其他盘即可

安装spark2.4.x

这里我安装2.4.8版本

  • 下载,下载地址:https://archive.apache.org/dist/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz

  • 解压到你的目录

  • 设置SPARK_HOME环境变量,并把%SPARK_HOME%\bin添加到Path

  • cmd 测试

    cmd–> 运行pyspark

    C:\Users\明柯>pyspark
    Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    22/02/11 17:21:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.8
          /_/
    
    Using Python version 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018 04:59:51)
    SparkSession available as 'spark'.
    >>>
    
  • quit()即可退出

  • 测试spark任务

    cmd --> spark-submit %SPARK_HOME%/examples/src/main/python/pi.py

​ 从日志中可以看到计算结果:

Pi is roughly 3.142780

Linux搭建Spark环境

也是提前需安装好JDK和Python3.7

这里也只演示单节点的,多节点的同理,Hadoop集群安装可参考:https://blog.csdn.net/maoyuanming0806/article/details/81177338

  • 下载,下载路径同上

    可以wget,或者window下载后通过传输工具传到linux中

  • 解压

    tar -zvxf spark-2.4.8-bin-hadoop2.7.tgz -C /opt/module
    mv spark-2.4.8-bin-hadoop2.7 spark-2.4.8
    
  • 测试

    cd /opt/module/spark-2.4.8
    bin/spark-submit examples/src/main/python/pi.py
    
    打印日志里可以看到结果
    Pi is roughly 3.137780
    
  • 设置环境变量

    vi /etc/profile
    添加
    #==================spark====================
    export SPARK_HOME=/opt/module/spark-2.4.8
    export PATH=$PATH:$SPARK_HOME/bin
    
    wq!保存后
    source /etc/profile
    
  • 修改日志级别

    修改conf目录下,复制log4j.properties.templatelog4j.properties

    修改里面的log4j.rootCategory=INFO, console即可

Jupyter Notebook安装

Linux环境JupyterNotebook整合pyspark

  • 安装jupyterNotebook

    pip3 install jupyter
    
  • 安装findspark

    jupyter访问spark需要findspark这个包

    pip3 install findspark
    
  • 启动jupyter

    若不知道jupyter命令安装在哪可以先find一下

    find / -name /jupyter
    

    或者

    cd /usr/local/python3/bin
    此目录下有jupyter命令,此目录如果不在环境变量中就需要这样来启动
    ./jupyter notebook --allow-root
    
  • 打开jupyternotebook的web页,测试

    新建一个文件

    import findspark
    findspark.init()
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()
    
    sc = spark.sparkContext
    rdd = sc.parallelize(["hello world", "hello spark"])
    rdd2 = rdd.flatMap(lambda line:line.split(" "))
    rdd3 = rdd2.map(lambda word:(word, 1))
    rdd5 = rdd3.reduceByKey(lambda a, b : a + b)
    print(rdd5.collect())
    
    sc.stop()
    

    输出结果

    [('hello', 2), ('spark', 1), ('world', 1)]
    

Window环境JupyterNotebook整合pyspark

先安装Anaconda,自行百度下载安装包然后安装吧,没啥特别步骤,Anaconda会把JupyterNotebook安装,这个Anaconda是个集成环境,也方便其他工具和python包安装,建议安装好

  • 安装Anaconda

  • 进入到Anaconda目录

  • 进入到Scripts目录

  • 在Scripts目录下打开命令行cmd,一定要到此目录下,否则安装的工具包,jupyternotebook都找不到,这是windows环境的坑

  • 安装findspark

    pip3 install findspark
    

    如果过程比较长,可以考虑换镜像为阿里的会快点

  • 测试:启动jupyternotebook,然后在浏览器中打开web页

  • 新建一个python3文件

    import findspark
    findspark.init()
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.master("local[*]").appName("wordCount").getOrCreate()
    
    sc = spark.sparkContext
    rdd = sc.parallelize(["hello world", "hello spark"])
    rdd2 = rdd.flatMap(lambda line:line.split(" "))
    rdd3 = rdd2.map(lambda word:(word, 1))
    rdd5 = rdd3.reduceByKey(lambda a, b : a + b)
    print(rdd5.collect())
    
    sc.stop()
    

    执行结果

    [('hello', 2), ('spark', 1), ('world', 1)]
    

在这里插入图片描述

网站文章