今天突然发现Spark SQL任务启动不起来,报下面的错误,'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on a random free port. You may check whether configuring an appropriate binding address. 2023-05-18 13:57:40,952 WARN util.Utils: Service,看到这段日志后,表明服务器大量端口被占用,Spark申请不到端口,尝试了100次后,抛出了下面的异常。
20/12/21 12:55:18 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on a random free port. You may check whether configuring an appropriate binding address. 20/12/21 12:55:18 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address. java.net.BindException: Address already in use: Service 'org.apache.spark.network.netty.NettyBlockTransferService' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'org.apache.spark.network.netty.NettyBlockTransferService' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address. at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:745) End of LogType:stderr
排查
执行ss命令,发现大量连接处于CLOSE_WAIT,状态,这非常不正常。ESTABLISHED表示连接已被建立,可以通信了,大量连接处于ESTABLISHED状态才有可能正常。然后执行netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'统计TCP连接状态,发现绝大部份的链接处于CLOSE_WAIT状态,这是非常不可思议情况。
第一步
执行netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'统计TCP连接状态。
第二部
用netstat -tnap命令进行检查。
第三步:查看tcp队列当前情况
1 2 3
ss -lntp State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 101 100