近来工作上 需要 批量化统计登陆日志(日活,访问量等) ,大约是40M一日志文件,一天有多个,数据量大约一天 几百万。
本来想得很天真,入表后group by这种,然后发现性能太差,支持不了。。于是就想到用程序去执行,读取,解析,统计,入表。后来听说了springbatch 等工具,这里来学习下.
springbatch一些文档
中文 https://kimmking.gitbooks.io/springbatchreference/
英文文档:http://docs.spring.io/spring-batch/trunk/reference/html/index.html
官网例子
官网的例子非常简单,,直接用java config来代替配置文件。
- 文件列表:
src
└─main
├─java
│ └─hello
│ Application.java
│ BatchConfiguration.java
│ JobCompletionNotificationListener.java
│ Person.java
│ PersonItemProcessor.java
│
└─resources
sample-data.csv
schema-all.sql
简要介绍:
Application 是程序启动类,没有业务
BatchConfiguration springbatch的配置,加了
@Configuration
@EnableBatchProcessing
这两个配置,只要spring启动就会 开启这个配置,
JobCompletionNotificationListener 这个是类似测试的,在批量化走完后,这个来查询下db,看看是否有数据
Person 是一个pojo
PersonItemProcessor 处理类,这里是把 用户名upper后再重新返回
sample-data.csv 这个就是数据源了,里面有几条数据,字段之间用逗号分隔
schema-all.sql 建表语句
代码里直接使用了sql, 内嵌数据库 是hsqldb。核心代码说明
BatchConfiguration.java
import org.springframework.batch.core.Job;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.launch.support.RunIdIncrementer;
import org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider;
import org.springframework.batch.item.database.JdbcBatchItemWriter;
import org.springframework.batch.item.file.FlatFileItemReader;
import org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper;
import org.springframework.batch.item.file.mapping.DefaultLineMapper;
import org.springframework.batch.item.file.transform.DelimitedLineTokenizer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.ClassPathResource;
import javax.sql.DataSource;
@Configuration
@EnableBatchProcessing
public class BatchConfiguration {
@Autowired
public JobBuilderFactory jobBuilderFactory;
@Autowired
public StepBuilderFactory stepBuilderFactory;
@Autowired
public DataSource dataSource;
// tag::readerwriterprocessor[]
@Bean
public FlatFileItemReader<Person> reader() {
FlatFileItemReader<Person> reader = new FlatFileItemReader<Person>();
reader.setResource(new ClassPathResource("sample-data.csv"));
reader.setLineMapper(new DefaultLineMapper<Person>() {{
setLineTokenizer(new DelimitedLineTokenizer() {{
setNames(new String[] { "firstName", "lastName" });
}});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}});
}});
return reader;
}
@Bean
public PersonItemProcessor processor() {
return new PersonItemProcessor();
}
@Bean
public JdbcBatchItemWriter<Person> writer() {
JdbcBatchItemWriter<Person> writer = new JdbcBatchItemWriter<Person>();
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<Person>());
writer.setSql("INSERT INTO people (first_name, last_name) VALUES (:firstName, :lastName)");
writer.setDataSource(dataSource);
return writer;
}
// end::readerwriterprocessor[]
// tag::jobstep[]
@Bean
public Job importUserJob(JobCompletionNotificationListener listener) {
return jobBuilderFactory.get("importUserJob")
.incrementer(new RunIdIncrementer())
.listener(listener)
.flow(step1())
.end()
.build();
}
@Bean
public Step step1() {
return stepBuilderFactory.get("step1")
.<Person, Person> chunk(10)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
// end::jobstep[]
}
上面定义 了reader,writer,process,一个job,和 step1
- process是直接调用了 PersonItemProcessor,这里只做了简单的转成大写的动作
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.batch.item.ItemProcessor;
public class PersonItemProcessor implements ItemProcessor<Person, Person> {
private static final Logger log = LoggerFactory.getLogger(PersonItemProcessor.class);
@Override
public Person process(final Person person) throws Exception {
final String firstName = person.getFirstName().toUpperCase();
final String lastName = person.getLastName().toUpperCase();
final Person transformedPerson = new Person(firstName, lastName);
log.info("Converting (" + person + ") into (" + transformedPerson + ")");
return transformedPerson;
}
}
流程
说明
程序用了java config 后,很多配置都没了,有些地方看得不明所以,看官方文档,大部分配置都是基于xml的.
一个比较正常点的例子
官网的例子有点飘逸,使用了sping boot 和其它的一些特性,找不到xml的影子,但是。。。batch的官方文档的例子,一堆的配置都是xml示例。。这里做个xml的配置例子,源代码在这里,
- 代码列表
main
├─java
│ └─com
│ └─yp
│ └─batch
│ │ App.java
│ │ PersonFieldSetMapper.java
│ │ PersonItemProcessor.java
│ │ PersonItemWriter.java
│ │
│ └─entity
│ Person.java
│
└─resources
applicationContext.xml
log4j.xml
sample-data.csv
- App.java为启动类
import org.apache.log4j.Logger;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobExecution;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class App {
static Logger log = Logger.getLogger(App.class);
public static void main(String[] args) throws Exception {
String[] springConfig = { "applicationContext.xml" };
ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(springConfig);
JobLauncher jobLauncher = (JobLauncher) context.getBean("jobLauncher");
Job job = (Job) context.getBean("helloWorldJob");
JobExecution execution = jobLauncher.run(job, new JobParameters());
log.info("Exit Status : " + execution.getStatus());
context.close();
}
}
把配置文件加载进来,然后调用JobLauncher.run 方法,
*applicationContext.xml为配置
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:batch="http://www.springframework.org/schema/batch"
xmlns:jdbc="http://www.springframework.org/schema/jdbc"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/batch http://www.springframework.org/schema/batch/spring-batch.xsd
http://www.springframework.org/schema/jdbc http://www.springframework.org/schema/jdbc/spring-jdbc.xsd">
<bean id="cvsFileItemReader" class="org.springframework.batch.item.file.FlatFileItemReader">
<property name="resource" value="classpath:sample-data.csv"/>
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<property name="lineTokenizer">
<bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
</bean>
</property>
<property name="fieldSetMapper">
<bean class="com.yp.batch.PersonFieldSetMapper"/>
</property>
</bean>
</property>
</bean>
<bean id="itemProcessor" class="com.yp.batch.PersonItemProcessor"/>
<bean id="personWriter" class="com.yp.batch.PersonItemWriter"></bean>
<bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository"/>
</bean>
<bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
<property name="transactionManager" ref="transactionManager"/>
</bean>
<bean id="transactionManager" class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>
<job id="helloWorldJob" xmlns="http://www.springframework.org/schema/batch">
<step id="step1">
<tasklet>
<chunk reader="cvsFileItemReader" writer="personWriter" processor="itemProcessor"
commit-interval="10">
</chunk>
</tasklet>
</step>
</job>
</beans>
上面的配置清晰明了,就不多说了,其它情况请参考源代码
- 数据即sample-data.csv是官网的
Jill,Doe
Joe,Doe
Justin,Doe
Jane,Doe
John,Doe
- 运行结果
20:00:55,214 INFO [PersonItemWriter] write : firstName: JILL, lastName: DOE
20:00:55,214 INFO [PersonItemWriter] write : firstName: JOE, lastName: DOE
20:00:55,214 INFO [PersonItemWriter] write : firstName: JUSTIN, lastName: DOE
20:00:55,214 INFO [PersonItemWriter] write : firstName: JANE, lastName: DOE
20:00:55,214 INFO [PersonItemWriter] write : firstName: JOHN, lastName: DOE