Skip to content

Parse wikipedia dump files to wiki-talk networks while preserving original wikipedia UIDs.

License

Notifications You must be signed in to change notification settings

yfiua/wiki-talk-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wiki-talk-parser

This little program can:

  • Parse wikipedia dump files (xml) to wiki-talk networks. Original wikipedia UIDs are remained.
  • "Shrink" the resulting network, so it is an unweighted directed network w/o loops, like in the SNAP wiki-Talk dataset.
  • Group users according to their roles.

Usage with stu

Use stu for easy lives. The only file you need is main.stu. Simply type in stu or:

$ nohup stu -k -j 3 &

Stu will automatically start downloading this program and the datasets, then parsing. The parameter -j defines the number of jobs that will run in parallel. For downloading, more than 3 is not recommended.

Usage without stu

Installation

Manually download the latest jar files.

Parse

$ java -jar parser.jar *input-file* *lang* > *output-file*

Shrink

$ java -jar shrinker.jar *input-file* > *output-file*

Group users

$ java -jar grouper.jar *input-file* > *output-file*

Compilation

$ lein with-profile parser:shrinker:grouper uberjar

License

Copyright © 2023 Yfiua

Distributed under the Eclipse Public License either version 1.0 or any later version.

About

Parse wikipedia dump files to wiki-talk networks while preserving original wikipedia UIDs.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages