3.00
0 comments
Created by mumble 12 minutes ago
[ 3.00 | 0.00 ] [#754]

OK. Since I was so obsessed with k5 for so long, I have quite a bit of k5 content.
With k5 down, I recommend you wget it all down, and spread it in all directions (maybe even a torrent?).
Seriously, my hosting gets almost no traffic, so it won't mind you downloading all of this.
Also, I've moved most of it from k5-stats.org (which will go away eventually) to k5.semantic-db.org which I'm going to keep in the long term.

Now, what do I have?
1) a big list of most k5 user names, and their user-id:
http://k5.semantic-db.org/full-k5-user-list.txt
This is a reminder of the k5 glory days, and of how many former kurons there really are.
eg, a quick look at the file has "104774 walverio" as the last entry. So there were at least 100,000 kurons! (ignoring dupes)

2) the results from my diary slurp:
http://k5.semantic-db.org/diary-slurp/

2.1) the script I used for this (heh, probably not much use now! though it shows how I converted k5 html into sw format):
https://github.com/GarryMorrison/Feynman-knowledge-engine/blob/master/create-sw-file-tools/slurp-diaries.py

2.2) in particular, the full set of k5 diaries (as of 2015-7-22), in nested-html form (1.3GB):
http://k5.semantic-db.org/diary-slurp/161942--archive-diaries--html-diaries--nested-format.zip

2.3) these diaries converted to my sw format (which is mostly useful for me, though shouldn't be hard to load into a real database) (443 MB):
http://k5.semantic-db.org/diary-slurp/full-k5--all-k5-diaries-in-sw-format.zip
And sw format is particularly easy to grep and sed to find what you want, since each learn rule is on one line (eg, newlines in posts have been escaped).

Now, a brief explanation of the sw format (as used in these files). They take the form:
some-operator |some ket> => |another ket>
some-operator |some ket> => |ket 1> + |ket 2> + |ket 3> + ... |ket n>

For example:
url |diary: 2006-2-4-92049-12395> => |url: http://www.kuro5hin.org/story/2006/2/4/92049/12395>
author |diary: 2006-2-4-92049-12395> => |kuron: gr3y>
title |diary: 2006-2-4-92049-12395> => |text: Behold the true face of Islam:>
child-comment |diary: 2006-2-4-92049-12395> => |comment: 2006-2-4-92049-12395-1> + |comment: 2006-2-4-92049-12395-2> + ...
author |comment: 2006-2-4-92049-12395-2> => |kuron: Lemon Juice>

Here are the stats for the full-k5-sw file:
$ ./spit-out-sw-stats.sh full-k5.sw
326e7a1af84da55e13ebcf328b166a398f3f0e59 *full-k5.sw
(2.1G, 18 op types and 868715 learn rules)
author: 96300 learn rules
body: 1204182 learn rules
body-wc: 85594 learn rules
child-comment: 82990 learn rules
date: 85594 learn rules
date-time: 1204182 learn rules
how-many-comments: 85594 learn rules
intro: 85594 learn rules
intro-wc: 85594 learn rules
is-top-level-comment: 1118588 learn rules
parent-comment: 696901 learn rules
parent-diary: 1118589 learn rules
tags: 53002 learn rules
title: 1204182 learn rules
total-wc: 85594 learn rules
url: 1204182 learn rules
wc: 1118588 learn rules

Bah! There is something wrong with that script.
This is the true number of total learn rules:
$ grep -c " => " full-k5.sw
17123667

Using the sw data, we can find things like:
comments per year:

diaries per year:

number of comments per diary:

OK. Let's dig into the details of that last graph. Use mumble lang to produce a table showing the diaries with the most comments:


$ ./the_semantic_db_console.py
Welcome!

sa: load full-k5--how-many-comments.sw
sa: rank-table[diary,how-many-comments] select[1,30] reverse sort-by[how-many-comments] rel-kets[how-many-comments]
+------+----------------------+-------------------+
| rank | diary | how-many-comments |
+------+----------------------+-------------------+
| 1 | 2002-5-7-202722-0453 | 835 |
| 2 | 2006-4-21-205756-081 | 356 |
| 3 | 2003-1-1-16127-40735 | 296 |
| 4 | 2004-5-18-13843-3909 | 276 |
| 5 | 2002-5-9-165212-3476 | 248 |
| 6 | 2004-9-7-41313-07421 | 244 |
| 7 | 2001-7-12-12343-2481 | 235 |
| 8 | 2002-4-26-142517-238 | 233 |
| 9 | 2006-1-5-16385-33113 | 223 |
| 10 | 2012-12-17-1649-9844 | 220 |
| 11 | 2002-4-23-104451-644 | 219 |
| 12 | 2003-3-19-8251-32475 | 211 |
| 13 | 2005-11-30-18310-255 | 210 |
| 14 | 2007-6-29-134551-001 | 207 |
| 15 | 2005-8-18-55631-0880 | 204 |
| 16 | 2005-12-30-13163-104 | 199 |
| 17 | 2002-4-25-93534-4989 | 198 |
| 18 | 2003-2-9-162755-1691 | 198 |
| 19 | 2005-12-28-14626-619 | 198 |
| 20 | 2007-5-7-103231-5181 | 198 |
| 21 | 2003-11-15-201021-60 | 193 |
| 22 | 2002-5-14-144525-460 | 189 |
| 23 | 2003-6-1-19331-10814 | 187 |
| 24 | 2006-3-7-171540-0259 | 186 |
| 25 | 2001-5-30-3634-11882 | 185 |
| 26 | 2003-4-25-21500-5428 | 185 |
| 27 | 2012-12-15-113011-53 | 184 |
| 28 | 2003-5-7-122415-5631 | 183 |
| 29 | 2005-1-14-172115-002 | 183 |
| 30 | 2003-1-31-880-29290 | 181 |
+------+----------------------+-------------------+

Next, the results from my monthly stats are here:
http://k5.semantic-db.org/k5-stats/

For example, last months stats:
http://k5.semantic-db.org/k5-stats/2016-03-27--results.sw

comments per month:

diaries per month:

Finally, the results from the first diary slurp:
http://k5.semantic-db.org/first-k5-slurp/

eg, things like "the official k5 dictionary":
http://k5.semantic-db.org/first-k5-slurp/the-official-k5-dictionary.txt

and the k5 frequency list:
http://k5.semantic-db.org/first-k5-slurp/the-k5-frequency-list.txt

The full data from that run (which only included 6,000 diaries) is here:
http://k5.semantic-db.org/first-k5-slurp/corpus/

And that is pretty much all I have!

Though back at my old host I still have the Crawfish archive:
http://crawfish.k5-stats.org/

Enjoy!

ps. I haven't heard back from Sye, so I don't know how much she has. But I think she has some of the section stories (she borrowed my script).


[ Reply ]